农机网_多模板页面,无限if抓取(源码)_一蓑烟雨任平生

最新推荐文章于 2021-05-19 11:46:39 发布

原创最新推荐文章于 2021-05-19 11:46:39 发布 · 282 阅读

0 ·

本内容遵循CC 4.0 BY-SA版权协议

python 同时被 2 个专栏收录

46 篇文章

订阅专栏

爬虫

35 篇文章

订阅专栏

本文介绍了一种针对农机网的爬虫实现方案，通过不断适应网页结构的变化来抓取新闻标题、发布时间及内容等信息，并将这些数据存入MySQL数据库。示例中详细展示了如何使用Python的requests库获取网页内容，利用BeautifulSoup进行解析，以及如何处理不同页面布局所带来的挑战。

这个网站算是比较棘手的了,因为每次标签位置都会变,一会标题在div里,一会再select里,一会又在span里,所以无限判断搞的,到最后我都不知道我在写什么了

废话不多说,直接上代码

今天要倒霉的网站是农机网

# -*- coding: utf-8 -*-
import requests
import pymysql
from bs4 import BeautifulSoup  # 用来解析网页
import uuid
import time

url = "https://www.nongjx.com"

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 '
                  'Safari/537.36',
    'Accept-Language': 'zh-CN,zh;q=0.8'
}
conn = pymysql.connect(host='127.0.0.1', user='root', passwd='123456', db='zhang', charset='utf8')
cur = conn.cursor()
print("连接成功")

for i in range(1, 10):  # 爬取第一页到第3页的数据
    resp = requests.get(f"https://www.nongjx.com/tech_news/t118/list_p{i}.html", headers=headers)
    page_one = BeautifulSoup(resp.content, "html.parser")
    dd = page_one.find('div', class_='mainLeftList').find_all('dt')
    for ss in dd:
        sUrl = url + ss.find('a')['href']
        # 打开二级网页进行爬取
        rp = requests.get(sUrl, headers=headers)
        page_two = BeautifulSoup(rp.content, "html.parser")
        papaer_id = str(uuid.uuid1())
        # 标题
        if page_two.find('section', class_='newsDetail') is None:
            if page_two.find('div', class_='newsDetail') is not None:
                title = page_two.find('div', class_='newsDetail').find('h3').text
                # 时间
                timet = page_two.find('div', class_='newsDetail').find('div').text.strip().split("：")[2]
                print(timet)
                # 内容
                content = page_two.find('div', class_='newsContent').text.strip()
            else:
                if page_two.find('div', class_='nr_box') is None:
                    if page_two.find('div', class_='main_news_details') is None:
                        title = page_two.find('div', class_='newsShow').find('a').text
                        # 时间
                        timet = page_two.find('div', class_='newsShow').find('dt').text.strip()[0:11]
                        print(timet)
                        # 内容
                        content = page_two.find('div', class_='newsContent').text.strip()
                    else:
                        title = page_two.find('div', class_='main_news_details').find('h1').text
                        # 时间
                        timet = page_two.find('div', class_='main_news_details').find('span').text.strip().split("：")[1]
                        print(timet)
                        # 内容
                        content = page_two.find('div', class_='news_detail_content').text.strip()
                else:
                    title = page_two.find('div', class_='nr_box').find('h3').text
                    # 时间
                    timet = page_two.find('div', class_='nr_box').find('p').text.strip().split("：")[2]
                    print(timet)
                    # 内容
                    content = page_two.find('div', class_='down_xx').text.strip()
        else:
            title = page_two.find('section', class_='newsDetail').find('h3').text
            # 时间
            timet = page_two.find('section', class_='newsDetail').find('span').text.strip().split("：")[2]
            print(timet)
            # 内容
            content = page_two.find('div', class_='newsContent').text.strip()
        sql = "insert into knowledge(id,title,timet,content,p_type,url) VALUES (%s,%s,%s,%s,%s,%s)"
        cur.execute(sql, (papaer_id, title, timet, content, "机械农业", sUrl))
    print("SQL正在执行第{}页执行完毕".format(i))
    conn.commit()
    time.sleep(1)  # 防止服务器蹦了,间隔一秒钟
cur.close()
conn.close()

在这里插入图片描述