python3.6简单代码_Python3.6 爬虫入门自学教程之四：urllib应用最简单的爬虫代码实例...

最新推荐文章于 2025-02-13 00:15:00 发布

原创最新推荐文章于 2025-02-13 00:15:00 发布 · 487 阅读

1 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#python3.6简单代码

本文介绍了Python3.6中使用urllib库进行网页爬取的基础知识，包括GET和POST请求的实现。通过实例代码展示了如何使用urllib.request.urlopen()进行简单的爬虫操作，以及如何构造Request对象来发送POST请求。同时，解释了urllib.parse.urlencode()函数在编码数据中的作用，并讨论了超时设置和错误处理的重要性。

Python3.6 爬虫入门之四urllib应用最简单的爬虫代码实例

1.简单爬虫实例代码-get请求方式

Python

# -*- coding: utf-8 -*-

import urllib.request

url = 'http://www.baidu.com/'

def getHtml(url):

page = urllib.request.urlopen(url)

html = page.read().decode(encoding='utf-8', errors='strict')

return html

print(getHtml(url))

# -*- coding: utf-8 -*-

importurllib.request

url='http://www.baidu.com/'

defgetHtml(url):

page=urllib.request.urlopen(url)

html=page.read().decode(encoding='utf-8',errors='strict')

returnhtml

print(getHtml(url))

上面这几行短短的代码，就已经实现了最简单的爬虫功能。点击pycharm软件的运行，下面的调试框就能打印出爬取下来的百度首页的HTML等代码了。(这里我使用的是pycharm软件进行编程调试的)

urlopen打开url链接之后，就获得了数据，这个数据需要进行解码，解析成HTML数据，所以使用了decode这个方法。

首先import urllib.request这个第三方库，这里需要注意一下，这个和python2.7里不一样。python 3.x中urllib库和urilib2库合并成了urllib库。

其中urllib2.urlopen()变成了urllib.request.urlopen()

urllib2.Request()变成了urllib.request.Request()

这几个库就是我们常用的url库，这里可以看到urllib.request是用于打开url链接的扩展库，urllib.response是常用的url应答类。urllib.parse是用于解析数据的模块。然后点击都能看到每个模块都能点击打开。都有详细的英文说明：https://docs.python.org/3.6/library/urllib.request.html

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

Open the URL url, which can be either a string or a Request object.

这个方法，实际上可以传入的参数有很多。我们常用的是三个参数的：

urllib.requeset.urlopen(url,data,timeout)，第一个参数url代表的是链接地址，第二个data是经过urllib.parse解析后转换的数据，然后这个数据作为data。这样就完成了一次post请求，而最上面的那个简单爬虫代码urllib.request.Request()只有1个参数实际上就是默认的get请求，而urlopen()中传入data，完成的就是post请求。下面这里的http://httpbin.org/post这个网站可以用来测试urllib各个功能的模拟站点使用。

2.post请求方式

Python

import urllib.parse

import urllib.request

data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')

print(data)

response = urllib.request.urlopen('http://httpbin.org/post', data=data)

print(response.read())

importurllib.parse

importurllib.request

data=bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8')

print(data)

response=urllib.request.urlopen('http://httpbin.org/post',data=data)

print(response.read())

打印如下信息：

Python

D:\python\python.exe E:/技术学习/Python代码/3.简单post/.idea/简单post.py

b'word=hello'

b'{"args":{},"data":"","files":{},"form":{"word":"hello"},"headers":{"Accept-Encoding":"identity","Connection":"close","Content-Length":"10","Content-Type":"application/x-www-form-urlencoded","Host":"httpbin.org","User-Agent":"Python-urllib/3.6"},"json":null,"origin":"117.136.8.65","url":"http://httpbin.org/post"}\n'

D:\python\python.exeE:/技术学习/Python代码/3.简单post/.idea/简单post.py

b'word=hello'

urlencode是一个函数，可将字符串以URL编码，用于编码处理。URL编码(URL encoding)，也称作百分号编码(Percent-encoding)，是特定上下文的统一资源定位符 (URL)的编码机制。——百度百科

urllib.parse.urlencode({‘word’: ‘hello’}), encoding=’utf8′)这个函数的详细说明，也可以在https://docs.python.org/3.6/library/urllib.parse.html这里有关于这个函数的讲解。不过是英文的

urllib.requeset.urlopen(url,data,timeout)，第三个参数，一般设置个合理值或者不设置。如果超时设置的时间太短，就会报错，报错之后我们就要进行相应的错误处理，在下一篇博文中我将进行表述。这里设置timeout = 1，正常返回结果。

Python

import urllib.parse

import urllib.request

data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')

print(data)

response = urllib.request.urlopen('http://httpbin.org/post', data=data,timeout=1)

print(response.read())

importurllib.parse

importurllib.request

data=bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8')

print(data)

response=urllib.request.urlopen('http://httpbin.org/post',data=data,timeout=1)

print(response.read())

如果这里将timeout = 1改写为 timeout = 0.1，就会报如下错误：

Python

File "D:\python\lib\urllib\request.py", line 1346, in http_open

return self.do_open(http.client.HTTPConnection, req)

File "D:\python\lib\urllib\request.py", line 1320, in do_open

raise URLError(err)

urllib.error.URLError:

File"D:\python\lib\urllib\request.py",line1346,inhttp_open

returnself.do_open(http.client.HTTPConnection,req)

File"D:\python\lib\urllib\request.py",line1320,indo_open

raiseURLError(err)

urllib.error.URLError:

3.urllib.request.Request()

classurllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

这个请求方式，比直接使用上面的urlopen更加规范，一般在进行网络请求的时候，还需要包含很多额外的信息，比如header请求头，通过构建一个request，服务器响应请求得到应答，这样显得逻辑上清晰明确。使用Request这个类的时候，还需要添加其他的一些基本。

Python

import urllib.request

request = urllib.request.Request("http://www.baidu.com")

response = urllib.request.urlopen(request)

print(response.read())

importurllib.request

request=urllib.request.Request("http://www.baidu.com")

response=urllib.request.urlopen(request)

print(response.read())

上面这段代码，和最上面的代码是效果是一样的，这里推荐使用这种方式。首先构建Request对象，然后使用urlopen打开数据。

下一篇博文，我们将使用urllib的高级方法，来进行设置请求头，通过爬虫模拟登陆csdn网站。

赞赏作者

微信赞赏支付宝赞赏

喜欢 (2)or分享 (0)