亚马逊类目 Top 100 数据采集实战指南 2026：反爬突破 + 完整 Python 工程方案

最新推荐文章于 2026-06-27 16:33:05 发布

原创最新推荐文章于 2026-06-27 16:33:05 发布 · 993 阅读

12 ·

本内容遵循CC 4.0 BY-SA版权协议

GEO检测

标签

#亚马逊类目 Top100 #亚马逊数据采集指南 #亚马逊类目数据抓取 #Amazon 类目数据爬虫 #Python 亚马逊爬虫

该文章已生成可运行项目，

开发板推荐：天空星STM32F407VET6开发板

超高性价比 STM32主控 | 超高主频 | 一板兼容百芯 | 比赛神器 | 沉金彩色丝印

点击查看

在这里插入图片描述

前言

亚马逊 Best Sellers 页面是公开可访问的，但稳定批量地做亚马逊类目 Top 100 数据采集，在工程层面比大多数人预想的复杂得多。本文从技术角度系统拆解采集难点，给出完整可运行的 Python 工程方案，并对比主流方案在实际生产环境中的表现差异。

适读人群：有 Python 基础、需要构建亚马逊数据采集管道的工程师或数据分析师。

1. 亚马逊类目榜单数据结构

亚马逊 Best Sellers 页面（/zgbs/ 路径）的 DOM 结构分为两层：

外层容器：#zg-ordered-list，包含 100 个商品卡片
商品卡片：li.zg-item-immersion，每个卡片包含排名、ASIN（从商品链接提取）、标题、图片、价格、评分、评论数等

关键字段提取路径（以当前页面结构为参考，可能随亚马逊更新而变化）：

# 典型字段提取（仅供原理说明，实际生产中建议使用 API 方案）
rank = card.select_one('.zg-bdg-text').text.strip('#')
title = card.select_one('.p13n-sc-truncated').get('title', '')
asin = card.select_one('a.a-link-normal')['href'].split('/dp/')[1].split('/')[0]
price = card.select_one('.p13n-sc-price').text if card.select_one('.p13n-sc-price') else None
rating = card.select_one('i.a-icon-star span').text.split()[0] if card.select_one('i.a-icon-star') else None

⚠️ 注意：以上选择器是 2024 年的参考结构，亚马逊 2024 年全年至少更新了 11 次 Best Sellers 页面结构，3 次导致选择器完全失效。在生产环境中硬编码选择器风险极高。

在这里插入图片描述

2. 反爬机制深度分析

2.1 四层防御体系

Layer 1: IP 频率限制
  ├── 阈值：约 30 次/分钟/IP（非官方，通过测试推断）
  ├── 响应：返回空白页 / 重定向至首页
  └── 绕过：代理池轮换 + 请求间隔控制

Layer 2: TLS 指纹检测
  ├── 检测项：TLS 版本、密码套件顺序、扩展字段
  ├── Python requests 默认指纹与 Chrome 差异明显
  └── 绕过：使用 curl-cffi 或 tls-client 库模拟浏览器 TLS

Layer 3: 行为特征分析
  ├── Cookie 链路连续性（是否有完整的会话历史）
  ├── 请求间隔规律性（机器人请求间隔往往过于均匀）
  └── 绕过：随机化请求间隔 + 完整 Cookie 管理

Layer 4: 动态 CAPTCHA 注入
  ├── 触发条件：综合评分超阈值后注入 reCAPTCHA
  ├── 普通代理池触发率：60-80%（高频场景）
  └── 无完美绕过方案（第三方 CAPTCHA 服务成本高且不稳定）

2.2 TLS 指纹绕过示例

# 使用 curl-cffi 模拟真实浏览器 TLS 指纹
from curl_cffi import requests as cffi_requests

session = cffi_requests.Session(impersonate="chrome110")
response = session.get(
    "https://www.amazon.com/Best-Sellers-Electronics/zgbs/electronics/",
    headers={
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
    }
)

局限性：即使解决了 TLS 指纹问题，Layer 3 和 Layer 4 的对抗依然存在，且亚马逊会持续更新检测规则。

3. 主流方案性能对比

维度	自建爬虫	Selenium/Playwright	SaaS 工具 API	Pangolinfo Scrape API
初始集成时间	3–5 天	2–3 天	1–2 天	30 分钟
维护成本/月	高（$3,000+ 工程师时间）	高	低	极低
CAPTCHA 处理	无内置方案	部分缓解	N/A	内置
页面结构变更适应	手动修复（6–48h 中断）	手动修复	服务商负责	自动（2–4h）
数据时效性	实时（若不被封）	实时（若不被封）	24–72h 延迟	实时（5–15s）
月成本（200 类目/日）	$18,500+	$25,000+	$279+	$120–300
A/B 测试页面处理	❌	❌	N/A	✅
Customer Says 字段	不稳定	不稳定	通常无	✅
SP 广告位采集率	60–70%	65–75%	N/A	98%

4. 基于 Pangolinfo API 的完整工程实现

4.1 环境配置

pip install requests pandas schedule loguru sqlite3

4.2 核心采集模块

"""
amazon_top100_collector.py
基于 Pangolinfo Scrape API 的亚马逊类目 Top 100 采集模块
"""

import requests
import json
import time
import random
from datetime import datetime
from typing import Optional
from loguru import logger
from concurrent.futures import ThreadPoolExecutor, as_completed

class AmazonTop100Collector:
    """亚马逊类目 Top 100 数据采集器"""
    
    API_ENDPOINT = "https://api.pangolinfo.com/scrape"
    
    def __init__(self, api_key: str, max_workers: int = 5):
        self.api_key = api_key
        self.max_workers = max_workers
        self.session = requests.Session()
        self.session.headers.update({
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        })
        logger.info(f"AmazonTop100Collector initialized | max_workers={max_workers}")
    
    def fetch_category(
        self, 
        category_url: str,
        marketplace: str = "US",
        output_format: str = "json",
        retry_times: int = 3
    ) -> Optional[dict]:
        """
        采集单个类目的 Top 100 数据
        
        Args:
            category_url: 亚马逊类目 Best Sellers URL
            marketplace: 站点代码 (US/UK/DE/JP/CA/FR/IT/ES)
            output_format: 输出格式 (json/markdown/html)
            retry_times: 失败重试次数
        
        Returns:
            包含 products 列表的字典，每个商品包含完整字段
        """
        payload = {
            "url": category_url,
            "marketplace": marketplace,
            "output_format": output_format,
            "parse_template": "amazon_bestsellers",
            "include_fields": [
                "rank", "asin", "title", "price", "original_price",
                "rating", "review_count", "brand", "is_prime",
                "badge", "fulfillment_type", "variant_count",
                "subcategory_path", "image_url", "customer_says",
                "sp_ad_position"
            ]
        }
        
        for attempt in range(retry_times):
            try:
                resp = self.session.post(
                    self.API_ENDPOINT,
                    json=payload,
                    timeout=30
                )
                resp.raise_for_status()
                data = resp.json()
                
                # 注入元数据
                ts = datetime.utcnow().isoformat()
                for item in data.get("products", []):
                    item["_scraped_at"] = ts
                    item["_category_url"] = category_url
                    item["_marketplace"] = marketplace
                
                product_count = len(data.get("products", []))
                logger.success(f"[{marketplace}] {category_url} → {product_count} products")
                return data
                
            except requests.exceptions.Timeout:
                logger.warning(f"Attempt {attempt+1}/{retry_times} timeout: {category_url}")
            except requests.exceptions.HTTPError as e:
                logger.error(f"HTTP {e.response.status_code}: {category_url}")
                if e.response.status_code in (401, 403):
                    break  # API key 问题，不重试
            except Exception as e:
                logger.error(f"Unexpected error: {e}")
            
            if attempt < retry_times - 1:
                wait = (attempt + 1) * 2 + random.uniform(0, 1)
                time.sleep(wait)
        
        return None
    
    def fetch_multiple_categories(
        self,
        category_configs: list[dict]
    ) -> list[dict]:
        """
        并发采集多个类目
        
        Args:
            category_configs: 类目配置列表
            格式: [{"url": "...", "marketplace": "US"}, ...]
        
        Returns:
            所有商品数据的扁平列表
        """
        all_products = []
        failed_categories = []
        
        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
            future_map = {
                executor.submit(
                    self.fetch_category,
                    config["url"],
                    config.get("marketplace", "US")
                ): config
                for config in category_configs
            }
            
            for future in as_completed(future_map):
                config = future_map[future]
                try:
                    result = future.result()
                    if result and result.get("products"):
                        all_products.extend(result["products"])
                    else:
                        failed_categories.append(config["url"])
                except Exception as e:
                    logger.error(f"Future error for {config['url']}: {e}")
                    failed_categories.append(config["url"])
        
        if failed_categories:
            logger.warning(f"{len(failed_categories)} categories failed: {failed_categories}")
        
        logger.info(f"Total collected: {len(all_products)} products across "
                   f"{len(category_configs) - len(failed_categories)} categories")
        return all_products

4.3 数据持久化模块

"""
storage.py
数据存储模块 - SQLite（可扩展至 PostgreSQL / BigQuery）
"""

import sqlite3
import pandas as pd
from pathlib import Path
from loguru import logger

class Top100Storage:
    
    CREATE_TABLE_SQL = """
        CREATE TABLE IF NOT EXISTS amazon_top100 (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            scraped_at TEXT NOT NULL,
            marketplace TEXT NOT NULL,
            category_url TEXT NOT NULL,
            rank INTEGER NOT NULL,
            asin TEXT NOT NULL,
            title TEXT,
            price REAL,
            original_price REAL,
            rating REAL,
            review_count INTEGER,
            brand TEXT,
            is_prime INTEGER DEFAULT 0,
            badge TEXT,
            fulfillment_type TEXT,
            variant_count INTEGER,
            subcategory_path TEXT,
            image_url TEXT,
            customer_says TEXT,
            sp_ad_position INTEGER,
            UNIQUE(scraped_at, asin, category_url, rank)
        )
    """
    
    CREATE_INDEX_SQLS = [
        "CREATE INDEX IF NOT EXISTS idx_asin ON amazon_top100(asin)",
        "CREATE INDEX IF NOT EXISTS idx_scraped_at ON amazon_top100(scraped_at)",
        "CREATE INDEX IF NOT EXISTS idx_category ON amazon_top100(category_url)",
        "CREATE INDEX IF NOT EXISTS idx_marketplace ON amazon_top100(marketplace)",
        "CREATE INDEX IF NOT EXISTS idx_rank ON amazon_top100(rank)",
    ]
    
    def __init__(self, db_path: str = "amazon_top100.db"):
        self.db_path = db_path
        self._init_db()
    
    def _init_db(self):
        with sqlite3.connect(self.db_path) as conn:
            conn.execute(self.CREATE_TABLE_SQL)
            for idx_sql in self.CREATE_INDEX_SQLS:
                conn.execute(idx_sql)
            conn.commit()
        logger.info(f"Database initialized: {self.db_path}")
    
    def upsert_products(self, products: list[dict]) -> int:
        """批量写入商品数据，重复记录自动跳过"""
        saved = 0
        with sqlite3.connect(self.db_path) as conn:
            for p in products:
                try:
                    conn.execute("""
                        INSERT OR IGNORE INTO amazon_top100
                        (scraped_at, marketplace, category_url, rank, asin, title,
                         price, original_price, rating, review_count, brand, is_prime,
                         badge, fulfillment_type, variant_count, subcategory_path,
                         image_url, customer_says, sp_ad_position)
                        VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)
                    """, (
                        p.get("_scraped_at"), p.get("_marketplace"),
                        p.get("_category_url"), p.get("rank"), p.get("asin"),
                        p.get("title"), p.get("price"), p.get("original_price"),
                        p.get("rating"), p.get("review_count"), p.get("brand"),
                        1 if p.get("is_prime") else 0, p.get("badge"),
                        p.get("fulfillment_type"), p.get("variant_count"),
                        p.get("subcategory_path"), p.get("image_url"),
                        p.get("customer_says"), p.get("sp_ad_position")
                    ))
                    saved += 1
                except Exception as e:
                    logger.error(f"Insert error for ASIN {p.get('asin')}: {e}")
            conn.commit()
        logger.info(f"Saved {saved}/{len(products)} records to DB")
        return saved
    
    def get_rank_changes(self, days: int = 7, min_improvement: int = 10) -> pd.DataFrame:
        """分析排名变化，识别快速上升商品"""
        query = f"""
            SELECT 
                asin, title, brand,
                category_url,
                MAX(rank) as rank_start,
                MIN(rank) as rank_best,
                MAX(rank) - MIN(rank) as rank_improvement,
                AVG(price) as avg_price,
                MAX(review_count) as max_reviews
            FROM amazon_top100
            WHERE scraped_at >= datetime('now', '-{days} days')
            GROUP BY asin, category_url
            HAVING rank_improvement >= {min_improvement}
            ORDER BY rank_improvement DESC
        """
        with sqlite3.connect(self.db_path) as conn:
            return pd.read_sql_query(query, conn)

4.4 主程序入口

"""
main.py
亚马逊类目 Top 100 数据采集主程序
"""

import schedule
import time
from loguru import logger
from collector import AmazonTop100Collector
from storage import Top100Storage

# 配置
API_KEY = "your_pangolinfo_api_key"  # https://tool.pangolinfo.com 获取

CATEGORIES = [
    {"url": "https://www.amazon.com/Best-Sellers-Electronics/zgbs/electronics/", "marketplace": "US"},
    {"url": "https://www.amazon.com/Best-Sellers-Home-Kitchen/zgbs/kitchen/", "marketplace": "US"},
    {"url": "https://www.amazon.com/Best-Sellers-Sports-Outdoors/zgbs/sporting-goods/", "marketplace": "US"},
    {"url": "https://www.amazon.com/Best-Sellers-Toys-Games/zgbs/toys-and-games/", "marketplace": "US"},
    {"url": "https://www.amazon.com/Best-Sellers-Beauty/zgbs/beauty/", "marketplace": "US"},
    {"url": "https://www.amazon.co.uk/Best-Sellers-Electronics/zgbs/electronics/", "marketplace": "UK"},
    # 按需扩展至数百个类目
]

collector = AmazonTop100Collector(api_key=API_KEY, max_workers=5)
storage = Top100Storage(db_path="amazon_top100.db")

def run_collection():
    logger.info("=== Starting collection job ===")
    products = collector.fetch_multiple_categories(CATEGORIES)
    storage.upsert_products(products)
    
    # 分析排名变化
    changes = storage.get_rank_changes(days=7, min_improvement=15)
    if not changes.empty:
        logger.info(f"\n=== Top Rising Products (7d) ===\n{changes.head(10).to_string()}")

# 每 8 小时采集一次
schedule.every(8).hours.do(run_collection)

if __name__ == "__main__":
    run_collection()  # 立即执行一次
    while True:
        schedule.run_pending()
        time.sleep(60)