电商价格监控实战：用动态代理 IP 做多店铺比价

最新推荐文章于 2026-06-28 15:59:07 发布

原创最新推荐文章于 2026-06-28 15:59:07 发布 · 810 阅读

11 ·

本内容遵循CC 4.0 BY-SA版权协议

GEO检测

标签

#tcp/ip #网络协议 #网络

技术栈：Python 3 + requests + BeautifulSoup
主要内容：

做一个“多店铺价格监控脚本”

用配置化方式支持多家电商

接入动态代理 IP 网关，降低封禁、提高成功率

⚠ 声明：本文示例仅用于教学演示。实际采集时，请严格遵守目标网站的 robots 协议、用户协议及当地法律法规，避免采集隐私数据、付费内容或进行任何违法行为。

一、为什么做电商价格监控需要动态代理 IP？

在真实业务里，如果你用一台服务器、一个固定 IP 去疯狂请求电商网站，很快就会遇到：

访问变慢、频繁超时
返回验证码、302 跳转、403 禁止访问
同一 IP 登录多个账号，容易触发风控

原因很简单：

网站已经把你这个 IP 标记成“高风险”了。

而使用动态代理 IP 网关的好处是：

自动轮换 IP：每次请求或一段时间内切换不同出口 IP
降低单 IP 压力：访问被分散到大量 IP 池中
多地区出口：可以根据业务选不同国家或地区

对于“多店铺价格监控”这种长时间、持续性的任务来说，动态代理几乎是标配。

二、项目目标与整体思路

我们要做的是一个简单但完整的“比价脚本”：

支持多个“店铺/平台”的商品搜索页
输入一个关键词（例如：某型号手机）
从每个店铺抓取：商品名 + 价格 + 店铺名 + 链接
把结果保存到 CSV，方便后续用 Excel / BI 工具分析
所有 HTTP 请求统一走动态代理网关

整体流程可以概括为：

定义店铺配置（URL 模板 + CSS 选择器）
写一个通用的 fetch_html() 函数（支持代理 + 重试）
为每种店铺写一个解析函数 parse_xxx()
把各店铺结果汇总、保存、输出

为了避免侵犯真实网站，这里统一用类似 https://shop-a.example.com 这种示例域名，你在实战中只需要替换成自己业务允许的目标网址即可。

三、环境准备

1. Python 版本

建议 Python 3.8 以上：

python --version

2. 安装依赖

pip install requests beautifulsoup4 lxml

四、项目结构设计（配置驱动思路）

我们先把“不同店铺”的差异收敛到一个配置列表里，方便后续扩展：

# shops_config.py（也可以直接写在脚本里）
SHOPS = [
    {
        "name": "ShopA",
        "search_url": "https://shop-a.example.com/search?q={keyword}",
        "item_selector": ".product-item",
        "title_selector": ".product-title",
        "price_selector": ".product-price",
        "link_selector": "a.product-link",
        "price_cleaner": lambda x: x.replace("¥", "").replace(",", "").strip()
    },
    {
        "name": "ShopB",
        "search_url": "https://shop-b.example.com/s?k={keyword}",
        "item_selector": "div.result-item",
        "title_selector": "h2.item-title",
        "price_selector": "span.item-price",
        "link_selector": "a.item-link",
        "price_cleaner": lambda x: x.replace("￥", "").replace(",", "").strip()
    },
    # 以后有新的店铺，按同样结构再加一条即可
]

这里的 CSS 选择器只是示例，真正在做项目时，你需要通过浏览器 F12 自己去定位 DOM 结构。

五、基础版：单店铺采集逻辑

先写一个对单个店铺抓取的函数，暂时不使用代理：

import requests
from bs4 import BeautifulSoup

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}

def fetch_html(url: str, proxies=None, timeout: int = 15) -> str:
    resp = requests.get(url, headers=HEADERS, proxies=proxies, timeout=timeout)
    resp.raise_for_status()
    return resp.text


def parse_shop(shop_conf, html_text: str):
    soup = BeautifulSoup(html_text, "lxml")
    items = soup.select(shop_conf["item_selector"])
    results = []

    for it in items:
        title_el = it.select_one(shop_conf["title_selector"])
        price_el = it.select_one(shop_conf["price_selector"])
        link_el = it.select_one(shop_conf["link_selector"])

        if not (title_el and price_el and link_el):
            continue

        title = title_el.get_text(strip=True)
        price_raw = price_el.get_text(strip=True)
        price = shop_conf["price_cleaner"](price_raw)
        link = link_el.get("href")

        # 有的网站返回相对链接，这里简单补一下域名
        if link and link.startswith("/"):
            # 从 search_url 截出域名部分
            base = shop_conf["search_url"].split("/", 3)
            if len(base) >= 3:
                link = base[0] + "//" + base[2] + link

        results.append({
            "shop": shop_conf["name"],
            "title": title,
            "price": price,
            "url": link
        })

    return results

然后写个简单的测试入口：

from shops_config import SHOPS

def test_single_shop():
    keyword = "iphone 15"
    shop = SHOPS[0]  # 先测试 ShopA
    url = shop["search_url"].format(keyword=keyword)
    html_text = fetch_html(url)
    data = parse_shop(shop, html_text)
    for row in data[:5]:
        print(row)

if __name__ == "__main__":
    test_single_shop()

到这一步，脚本已经能“在一个店铺里搜一个关键词，并解析出多个商品”。

六、接入动态代理 IP 网关（核心）

下面是本文最关键的部分：把所有请求统一走动态代理。

假设你提供给用户/自己拿到的是这种形式的接入信息：

网关地址：gw.yourproxy.com
端口：8000
用户名：your_user
密码：your_pass

我们封装一个 get_proxies() 函数，让上面的 fetch_html() 直接用：

# proxy_config.py
USE_PROXY = True          # 是否启用代理
USE_DYNAMIC = True        # 这里主要演示动态代理

DYNAMIC_PROXY = {
    "host": "gw.yourproxy.com",
    "port": 8000,
    "username": "your_user",
    "password": "your_pass",
    # 某些服务商支持在用户名里带“国家/地区/ISP”参数，也可以直接写在这里
    # 例如 "username": "your_user-country=us"
}

def get_proxies():
    if not USE_PROXY:
        return None

    conf = DYNAMIC_PROXY
    proxy_str = f"http://{conf['username']}:{conf['password']}@{conf['host']}:{conf['port']}"
    return {
        "http": proxy_str,
        "https": proxy_str,
    }

然后把前面的 fetch_html() 改造一下，加入 重试机制 + 代理参数。

import time
import random
import logging
import requests
from proxy_config import get_proxies

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s"
)

MAX_RETRY = 3

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}

def fetch_html(url: str, timeout: int = 15) -> str:
    proxies = get_proxies()
    last_error = None

    for attempt in range(1, MAX_RETRY + 1):
        try:
            logging.info(f"Request url={url}, attempt={attempt}")
            resp = requests.get(
                url,
                headers=HEADERS,
                proxies=proxies,
                timeout=timeout,
                verify=False   # 一些代理会影响证书验证，必要时可以关闭
            )

            # 某些站点 200 但其实是验证码页，可以根据标题关键字判断
            if resp.status_code == 200 and "captcha" not in resp.text.lower():
                return resp.text
            else:
                logging.warning(
                    f"Status={resp.status_code}, maybe blocked or redirected."
                )

        except Exception as e:
            last_error = e
            logging.warning(f"Attempt {attempt} failed: {e}")

        # 失败后稍微等一下，保护代理 IP
        time.sleep(random.uniform(1, 3))

    raise RuntimeError(f"Request failed after {MAX_RETRY} attempts, last_error={last_error}")

到这里，你的脚本已经变成：

“所有请求默认走动态代理网关 + 自动重试 + 简单风控判断”

对于 CSDN 的读者，这段往往是最有价值的，可以直接拿到自己项目里用。

七、多店铺统一比价：完整脚本示例

下面是一个可以直接跑的“多店铺比价主程序”，整合了前面的所有内容：

# price_monitor.py
import csv
import logging
import random
import time

from shops_config import SHOPS
from proxy_config import get_proxies
from bs4 import BeautifulSoup
import requests

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s"
)

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}

MAX_RETRY = 3


def fetch_html(url: str, timeout: int = 15) -> str:
    proxies = get_proxies()
    last_error = None

    for attempt in range(1, MAX_RETRY + 1):
        try:
            logging.info(f"Request url={url}, attempt={attempt}")
            resp = requests.get(
                url,
                headers=HEADERS,
                proxies=proxies,
                timeout=timeout,
                verify=False
            )
            if resp.status_code == 200 and "captcha" not in resp.text.lower():
                return resp.text
            else:
                logging.warning(
                    f"Status={resp.status_code}, maybe blocked or redirected."
                )
        except Exception as e:
            last_error = e
            logging.warning(f"Attempt {attempt} failed: {e}")

        time.sleep(random.uniform(1, 3))

    raise RuntimeError(f"Request failed after {MAX_RETRY} attempts, last_error={last_error}")


def parse_shop(shop_conf, html_text: str):
    soup = BeautifulSoup(html_text, "lxml")
    items = soup.select(shop_conf["item_selector"])
    results = []

    for it in items:
        title_el = it.select_one(shop_conf["title_selector"])
        price_el = it.select_one(shop_conf["price_selector"])
        link_el = it.select_one(shop_conf["link_selector"])

        if not (title_el and price_el and link_el):
            continue

        title = title_el.get_text(strip=True)
        price_raw = price_el.get_text(strip=True)
        price = shop_conf["price_cleaner"](price_raw)
        link = link_el.get("href")

        if link and link.startswith("/"):
            base = shop_conf["search_url"].split("/", 3)
            if len(base) >= 3:
                link = base[0] + "//" + base[2] + link

        results.append({
            "shop": shop_conf["name"],
            "title": title,
            "price": price,
            "url": link
        })

    return results


def save_to_csv(rows, filename="price_monitor_result.csv"):
    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=["shop", "title", "price", "url"])
        writer.writeheader()
        for row in rows:
            writer.writerow(row)


def main():
    keyword = input("请输入要监控的商品关键词：").strip()
    all_results = []

    for shop in SHOPS:
        search_url = shop["search_url"].format(keyword=keyword)
        try:
            html_text = fetch_html(search_url)
            data = parse_shop(shop, html_text)
            logging.info(f"[{shop['name']}] got {len(data)} items.")
            all_results.extend(data)
        except Exception as e:
            logging.error(f"[{shop['name']}] failed: {e}")

        time.sleep(random.uniform(1, 3))  # 店铺之间也稍微歇一下

    # 简单按价格排序（价格字段为字符串的话需要转成 float）
    for item in all_results:
        try:
            item["price_value"] = float(item["price"])
        except ValueError:
            item["price_value"] = 999999999  # 解析失败的放后面

    all_results.sort(key=lambda x: x["price_value"])

    save_to_csv(all_results)
    logging.info(f"Done. total={len(all_results)} rows saved to price_monitor_result.csv")


if __name__ == "__main__":
    main()

你只需要：

根据实际目标站修改 SHOPS 里的 CSS 选择器

将 proxy_config.py 里的 host/port/username/password 替换为你自己的动态代理网关

在命令行运行 python price_monitor.py，输入关键词即可。

八、进一步优化方向

在真实生产环境，可以在这个基础上继续做很多升级：

任务调度
- 用 crontab / apscheduler 定时跑，比如每 30 分钟监控一次价格
- 结合企业内部任务调度系统（Airflow / Celery 等）
告警 & 通知
- 当价格低于某个阈值时，自动发邮件 / 企业微信 / 钉钉通知
- 当某个店铺长时间采集失败时，发告警排查 IP 或解析逻辑
多地区价格对比
- 使用动态代理指定出口国家，对比不同地区的价格策略
- 比如：username 里带上 country=us / country=jp 等参数
更精细的反爬应对
- 加上 Cookie / Session 维持
- 模拟合理的浏览行为：访问详情页、翻页等
- 使用 IP + UA + 请求频率的综合策略，减少被风控的概率