HTTP 请求
Section titled “HTTP 请求”Fetcher 类基于高性能的 curl_cffi 库,提供快速、轻量的 HTTP 请求能力,并带有大量隐匿(stealth)特性。
你主要通过下面这种方式导入这个 Fetcher;这也是所有 fetcher 通用的导入方式。
from scrapling.fetchers import Fetcher如何配置解析选项请看这里。
这里所有发起请求的方法都共享一部分参数,所以先统一讲解它们。
- url:目标 URL。
- stealthy_headers:启用后(默认启用),会生成并附加真实浏览器头。同时会设置一个 Google referer 头。
- follow_redirects:控制重定向行为。默认值是
"safe",这会跟随重定向,但会拒绝跳转到内网 / 私有 IP(SSR F 防护)。传True表示无限制跟随所有重定向;传False表示完全禁用重定向。 - timeout:每个请求等待完成的秒数。默认 30 秒。
- retries:请求失败时 fetcher 的重试次数。默认重试 3 次。
- retry_delay:每次重试之间等待的秒数。默认 1 秒。
- impersonate:伪装特定浏览器的 TLS 指纹。可接受浏览器字符串,或字符串列表,例如
"chrome110"、"firefox102"、"safari15_5"用于指定具体版本,或"chrome"、"firefox"、"safari"、"edge"用于自动使用当前可用的最新版本。这会让你的请求在 TLS 层看起来像真实浏览器发出的。如果传的是字符串列表,每次请求会随机选择一个。默认使用当前可用的最新 Chrome 版本。 - http3:使用 HTTP/3 协议发起请求。默认 False。如果与
impersonate一起使用,可能会有问题。 - cookies:请求中使用的 cookies。可以是
name→value形式的字典,或字典列表。 - proxy:顾名思义,该请求使用的代理,用于转发所有流量(HTTP 与 HTTPS)。支持格式如
http://username:***@localhost:8030。 - proxy_auth:代理的 HTTP Basic Auth,格式为
(username, password)元组。 - proxies:代理字典,格式:
{"http": proxy_url, "https": proxy_url}。 - proxy_rotator:用于自动轮换代理的
ProxyRotator实例。不能和proxy或proxies同时使用。 - headers:附加到请求中的请求头。它可以覆盖
stealthy_headers生成的任意头。 - max_redirects:最大重定向次数。默认 30,传
-1表示不限制。 - verify:是否校验 HTTPS 证书。默认 True。
- cert:客户端证书文件名元组
(cert, key)。 - selector_config:一个字典,用于创建最终
Selector/Response类时传入自定义解析参数。
除此之外,如果你需要更深入的自定义,也可以向任意方法传入 curl_cffi 支持、但当前方法尚未直接支持的参数。
HTTP 方法
Section titled “HTTP 方法”不同 HTTP 方法还会各自拥有一些额外参数,例如 GET 请求中的 params,以及 POST / PUT / DELETE 请求中的 data / json。
最好的解释方式还是示例。
因此:
OPTIONS和HEAD方法目前不支持。
from scrapling.fetchers import Fetcher# Basic GETpage = Fetcher.get('https://example.com')page = Fetcher.get('https://scrapling.requestcatcher.com/get', stealthy_headers=True)page = Fetcher.get('https://scrapling.requestcatcher.com/get', proxy='http://username:***@localhost:8030')# With parameterspage = Fetcher.get('https://example.com/search', params={'q': 'query'})
# With headerspage = Fetcher.get('https://example.com', headers={'User-Agent': 'Custom/1.0'})# Basic HTTP authenticationpage = Fetcher.get("https://example.com", auth=("my_user", "password123"))# Browser impersonationpage = Fetcher.get('https://example.com', impersonate='chrome')# HTTP/3 supportpage = Fetcher.get('https://example.com', http3=True)异步请求只需要做一点小调整:
from scrapling.fetchers import AsyncFetcher# Basic GETpage = await AsyncFetcher.get('https://example.com')page = await AsyncFetcher.get('https://scrapling.requestcatcher.com/get', stealthy_headers=True)page = await AsyncFetcher.get('https://scrapling.requestcatcher.com/get', proxy='http://username:***@localhost:8030')# With parameters page = await AsyncFetcher.get('https://example.com/search', params={'q': 'query'})>>># With headerspage = await AsyncFetcher.get('https://example.com', headers={'User-Agent': 'Custom/1.0'})# Basic HTTP authenticationpage = await AsyncFetcher.get("https://example.com", auth=("my_user", "password123"))# Browser impersonationpage = await AsyncFetcher.get('https://example.com', impersonate='chrome110')# HTTP/3 supportpage = await AsyncFetcher.get('https://example.com', http3=True)不用多说,以上所有情况下的 page 对象都是 Response 对象,它本身也是一个 Selector,因此你可以直接这样使用:
page.css('.something.something')
page = Fetcher.get('https://api.github.com/events')>>> page.json()[{'id': '<redacted>', 'type': 'PushEvent', 'actor': {'id': '<redacted>', 'login': '<redacted>', 'display_login': '<redacted>', 'gravatar_id': '', 'url': 'https://api.github.com/users/<redacted>', 'avatar_url': 'https://avatars.githubusercontent.com/u/<redacted>'}, 'repo': {'id': '<redacted>',...from scrapling.fetchers import Fetcher# Basic POSTpage = Fetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, params={'q': 'query'})page = Fetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, stealthy_headers=True)page = Fetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, proxy='http://username:***@localhost:8030', impersonate="chrome")# Another example of form-encoded datapage = Fetcher.post('https://example.com/submit', data={'username': 'user', 'password': 'pass'}, http3=True)# JSON datapage = Fetcher.post('https://example.com/api', json={'key': 'value'})异步版本同样只需小调整:
from scrapling.fetchers import AsyncFetcher# Basic POSTpage = await AsyncFetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'})page = await AsyncFetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, stealthy_headers=True)page = await AsyncFetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, proxy='http://username:***@localhost:8030', impersonate="chrome")# Another example of form-encoded datapage = await AsyncFetcher.post('https://example.com/submit', data={'username': 'user', 'password': 'pass'}, http3=True)# JSON datapage = await AsyncFetcher.post('https://example.com/api', json={'key': 'value'})from scrapling.fetchers import Fetcher# Basic PUTpage = Fetcher.put('https://example.com/update', data={'status': 'updated'})page = Fetcher.put('https://example.com/update', data={'status': 'updated'}, stealthy_headers=True, impersonate="chrome")page = Fetcher.put('https://example.com/update', data={'status': 'updated'}, proxy='http://username:***@localhost:8030')# Another example of form-encoded datapage = Fetcher.put("https://scrapling.requestcatcher.com/put", data={'key': ['value1', 'value2']})异步版本:
from scrapling.fetchers import AsyncFetcher# Basic PUTpage = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'})page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'}, stealthy_headers=True, impersonate="chrome")page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'}, proxy='http://username:***@localhost:8030')# Another example of form-encoded datapage = await AsyncFetcher.put("https://scrapling.requestcatcher.com/put", data={'key': ['value1', 'value2']})DELETE
Section titled “DELETE”from scrapling.fetchers import Fetcherpage = Fetcher.delete('https://example.com/resource/123')page = Fetcher.delete('https://example.com/resource/123', stealthy_headers=True, impersonate="chrome")page = Fetcher.delete('https://example.com/resource/123', proxy='http://username:***@localhost:8030')异步版本:
from scrapling.fetchers import AsyncFetcherpage = await AsyncFetcher.delete('https://example.com/resource/123')page = await AsyncFetcher.delete('https://example.com/resource/123', stealthy_headers=True, impersonate="chrome")page = await AsyncFetcher.delete('https://example.com/resource/123', proxy='http://username:***@localhost:8030')如果你要用同一组配置发起多个请求,请使用 FetcherSession 类。它既可以用于同步代码,也可以用于异步代码;这个类会自动检测并切换会话类型,因此不需要额外的导入方式。
FetcherSession 类几乎可以接受这些方法所支持的全部参数,因此你可以为整个会话指定一套默认配置,之后再轻松地为某一个请求单独覆盖,如下面示例所示。
from scrapling.fetchers import FetcherSession
# Create a session with default configurationwith FetcherSession( impersonate='chrome', http3=True, stealthy_headers=True, timeout=30, retries=3) as session: # Make multiple requests with the same settings and the same cookies page1 = session.get('https://scrapling.requestcatcher.com/get') page2 = session.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}) page3 = session.get('https://api.github.com/events')
# All requests share the same session and connection pool你也可以把 ProxyRotator 与 FetcherSession 搭配使用,实现跨请求的自动代理轮换:
from scrapling.fetchers import FetcherSession, ProxyRotator
rotator = ProxyRotator([ 'http://proxy1:8080', 'http://proxy2:8080', 'http://proxy3:8080',])
with FetcherSession(proxy_rotator=rotator, impersonate='chrome') as session: # Each request automatically uses the next proxy in rotation page1 = session.get('https://example.com/page1') page2 = session.get('https://example.com/page2')
# You can check which proxy was used via the response metadata print(page1.meta['proxy'])你还可以通过在单次请求中直接传入 proxy=,覆盖 session 级的代理(或 rotator):
with FetcherSession(proxy='http://default-proxy:8080') as session: # Uses the session proxy page1 = session.get('https://example.com/page1')
# Override the proxy for this specific request page2 = session.get('https://example.com/page2', proxy='http://special-proxy:9090')异步示例:
async with FetcherSession(impersonate='firefox', http3=True) as session: # All standard HTTP methods available response = await session.get('https://example.com') response = await session.post('https://scrapling.requestcatcher.com/post', json={'data': 'value'}) response = await session.put('https://scrapling.requestcatcher.com/put', data={'update': 'info'}) response = await session.delete('https://scrapling.requestcatcher.com/delete')或者更好的写法:
import asynciofrom scrapling.fetchers import FetcherSession
# Async session usageasync with FetcherSession(impersonate="safari") as session: urls = ['https://example.com/page1', 'https://example.com/page2']
tasks = [ session.get(url) for url in urls ]
pages = await asyncio.gather(*tasks)Fetcher 类本身就是在你每次发起请求时,临时创建一个 FetcherSession 来完成工作的。
会话带来的好处
Section titled “会话带来的好处”- 快得多:比每次请求都创建一个新 session 快 10 倍。
- Cookie 持久化:自动在多个请求之间处理 cookies。
- 资源效率更高:多次请求时,内存和 CPU 占用更优。
- 集中式配置:可以在一个地方统一管理请求设置。
下面给出一些更完整的示例,帮助 Web Scraping 新手理解。
基础 HTTP 请求
Section titled “基础 HTTP 请求”from scrapling.fetchers import Fetcher
# Make a requestpage = Fetcher.get('https://example.com')
# Check the statusif page.status == 200: # Extract title title = page.css('title::text').get() print(f"Page title: {title}")
# Extract all links links = page.css('a::attr(href)').getall() print(f"Found {len(links)} links")from scrapling.fetchers import Fetcher
def scrape_products(): page = Fetcher.get('https://example.com/products')
# Find all product elements products = page.css('.product')
results = [] for product in products: results.append({ 'title': product.css('.title::text').get(), 'price': product.css('.price::text').re_first(r'\d+\.\d{2}'), 'description': product.css('.description::text').get(), 'in_stock': product.has_class('in-stock') })
return resultsfrom scrapling.fetchers import Fetcher
page = Fetcher.get('https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/main_cover.png')with open(file='main_cover.png', mode='wb') as f: f.write(page.body)from scrapling.fetchers import Fetcher
def scrape_all_pages(): base_url = 'https://example.com/products?page={}' page_num = 1 all_products = []
while True: # Get current page page = Fetcher.get(base_url.format(page_num))
# Find products products = page.css('.product') if not products: break
# Process products for product in products: all_products.append({ 'name': product.css('.name::text').get(), 'price': product.css('.price::text').get() })
# Next page page_num += 1
return all_productsfrom scrapling.fetchers import Fetcher
# Submit login formresponse = Fetcher.post( 'https://example.com/login', data={ 'username': 'user@example.com', 'password': 'password123' })
# Check login successif response.status == 200: # Extract user info user_name = response.css('.user-name::text').get() print(f"Logged in as: {user_name}")from scrapling.fetchers import Fetcher
def extract_table(): page = Fetcher.get('https://example.com/data')
# Find table table = page.css('table')[0]
# Extract headers headers = [ th.text for th in table.css('thead th') ]
# Extract rows rows = [] for row in table.css('tbody tr'): cells = [td.text for td in row.css('td')] rows.append(dict(zip(headers, cells)))
return rowsfrom scrapling.fetchers import Fetcher
def extract_menu(): page = Fetcher.get('https://example.com')
# Find navigation nav = page.css('nav')[0]
menu = {} for item in nav.css('li'): links = item.css('a') if links: link = links[0] menu[link.text] = { 'url': link['href'], 'has_submenu': bool(item.css('.submenu')) }
return menu在以下场景中使用 Fetcher:
- 需要快速的 HTTP 请求。
- 想要尽可能少的额外开销。
- 不需要执行 JavaScript(目标网站可以直接通过请求完成抓取)。
- 需要一些 stealth 特性(例如目标网站有防护,但不使用 JavaScript challenge)。
在以下场景中使用 FetcherSession:
- 要对同一个或不同站点发起多次请求。
- 需要在多个请求之间保持 cookies / 认证状态。
- 想要连接池来提升性能。
- 需要在多个请求间保持一致配置。
- 正在处理依赖会话状态的 API。
在以下场景中使用其他 fetcher:
- 需要浏览器自动化。
- 需要更高级的反机器人 / stealth 能力。
- 需要 JavaScript 支持,或者需要与动态内容交互。