gzl的博客

  • 首页

  • 关于

  • 标签

  • 分类

  • 归档

python Requests库入门与实战

发表于 2018-09-26 更新于 2019-11-25 分类于 python

中国大学MOOC上的北京理工大学开的一门《python》网络爬虫与信息提取这门课的学习记录。

基础代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import requests


def getHTMLText(url):
try:
r = requests.get(url, timeout=30)
# https://2.python-requests.org/en/master/user/quickstart/#response-status-codes
r.raise_for_status()
# https://2.python-requests.org/en/master/api/#requests.Response.apparent_encoding
r.encoding = r.apparent_encoding
# https://2.python-requests.org/en/master/api/#requests.Response.text
return r.text
except:
return "产生异常"


if __name__ == "__main__":
url = "http://www.baidu.com"
print(getHTMLText(url))

如何简单地理解Python中的if name==”main“:

其中 r.raise_for_status() )在方法内部判断 r.status_code 是否等于200,不需要增加额外的if语句,该语句便于利用try‐except进行异常处理,r.apparent_encoding 是从内容中分析出的响应内容编码方式(备选编码方式),r.encoding 是从HTTP header中猜测的响应内容编码方式。

在上面的例子中 r.apparent_encoding 是 utf-8,r.encoding 是 ISO-8859-1,如果没有 r.encoding = r.apparent_encoding 这行代码的话,返回的不是中文的百度首页。

https://2.python-requests.org/en/master/api/#requests.Response.text

r.text :Content of the response, in unicode.

If Response.encoding is None, encoding will be guessed using chardet.

The encoding of the response content is determined based solely on HTTP headers, following RFC 2616 to the letter. If you can take advantage of non-HTTP knowledge to make a better guess at the encoding, you should set r.encoding appropriately before accessing this property.

Robots协议

作用:网站告知网络爬虫哪些页面可以爬取,哪些不可以爬取

形式:京东的Robots协议

1
2
3
4
5
6
7
8
9
10
11
12
User-agent: * 
Disallow: /?*
Disallow: /pop/*.html
Disallow: /pinpai/*.html?*
User-agent: EtaoSpider
Disallow: /
User-agent: HuihuiSpider
Disallow: /
User-agent: GwdangSpider
Disallow: /
User-agent: WochachaSpider
Disallow: /
1
2
3
# 注释,*代表所有,/代表根目录
User‐agent: *
Disallow: /

Requests实战

百度360搜索关键词提交

1
2
3
4
5
6
7
8
9
10
import requests
keyword = "Python"
try:
kv = {'wd': keyword}
r = requests.get("http://www.baidu.com/s",params=kv)
print(r.request.url) # http://www.baidu.com/s?wd=Python
r.raise_for_status()
print(len(r.text))
except:
print("爬取失败")

百度关键词接口:https://www.baidu.com/s?wd=keyword

网络图片的爬取和存储

文件操作上不熟悉…

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import requests
import os
url = "http://image.ngchina.com.cn/2018/0926/20180926031035591.jpg"
root = "D://pics//"
path = root + url.split('/')[-1]
try:
if not os.path.exists(root):
os.mkdir(root)
if not os.path.exists(path):
r = requests.get(url)
with open(path,'wb') as f:
f.write(r.content)
f.close()
print("文件保存成功")
else:
print("文件已存在")
except:
print("爬取失败")

dom结构

发表于 2018-08-25 更新于 2019-11-27

一、DOM基本操作

  1. getElementById方法定义在Document.prototype上,即Element节点上不能使用。

  2. getElementsByName方法定义在HTMLDocument.prototype上,即非html中的document
    不能使用(xml document,Element)

  3. getElementsByTagName方法定义在Document.prototype和Element.prototype上

  4. HTMLDocument.prototype定义了一些常用属性,body,head分别指代HTML文档中的和标签

  5. Document.prototype上定义了documentElement属性,指代文档的根元素,在HTML文档中,
    他总是指代<html>元素

  6. getElementsByClassName,querySelectorAll,querySelector在Document.prototype,
    Element.prototype类中均有定义

二、DOM结构树

            1.Document        HTMLDocument

                              1)Text
            2.CharacterData
                              2)Comment
Node
            3.Element         HTMLElement


            4.Attr            

JS兼容性方法

发表于 2018-08-08 更新于 2019-09-07 分类于 JavaScript

一、封装type方法

1.分两类  原始值  引用值

2.区分引用值
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
function type(target) {
var ret = typeof (target);
var template = {
"[object Array]": "array",
"[object Object]": "object",
"[object Number]": "number - object",
"[object Boolean]": "boolean - object",
"[object String]": "string - object"
}

if (target === null) {
return "null";
} else if (ret == "object") {
// 数组 对象 包装类
var str = Object.prototype.toString.call(target);
return template[str];
} else {
return ret;
}
}

二、封装兼容性方法,求滚动轮滚动距离getScrollOffset()

1
2
3
4
5
6
7
8
9
10
11
12
13
function getScrollOffset() {
if (0 && window.pageXOffset) {
return {
x: window.pageXOffset,
y: window.pageYOffset
}
} else {
return {
x: document.body.scrollLeft + document.documentElement.scrollLeft,
y: document.body.scrollTop + document.documentElement.scrollTop
}
}
}

三、封装兼容性方法,返回浏览器视口尺寸getViewportOffset()

怪异模式与标准模式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
function getViewportOffset() {
if (window.innerWidth) {
return {
w: window.innerWidth,
h: window.innerHeight
}
} else {
if (document.compatMode === "BackCompat") {
return {
w: document.body.clientWidth,
h: document.body.clientHeight
}
} else {
return {
w: document.documentElement.clientWidth,
h: document.documentElement.clientHeight
}
}
}
}

四、封装兼容性方法,获取元素样式

1
2
3
4
5
6
7
function getStyle(elem, prop) {
if (window.getComputedStyle) {
return window.getComputedStyle(elem, null)[prop];
} else {
return elem.currentStyle[prop];
}
}

五、封装兼容性的addEvent事件绑定

1
2
3
4
5
6
7
8
9
10
11
function addEvent(elem, type, handle) {
if (elem.addEventListener) {
elem.addEventListener(type, handle, false);
} else if (elem.attachEvent) {
elem.attachEvent('on' + type, function () {
handle.call(elem);
})
} else {
elem['on' + type] = handle;
}
}

六、防止事件冒泡

1
2
3
4
5
6
7
function stopBubble(event) {
if (event.stopPropagation) {
event.stopPropagation();
} else {
event.cancelBubble = true;
}
}

七、阻止默认事件

1
2
3
4
5
6
7
8
9
10
11
12
document.oncontextmenu = function (e) {
console.log('a');
cancelHandler(e);
}

function cancelHandler(event) {
if (event.preventDefault) {
event.preventDefault();
} else {
event.returnValue = false;
}
}

八、加载script

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
function loadScript(url, callback) {
var script = document.createElement('script');
script.type = "text/javascript";

if (script.readyState) {
script.onreadystatechange = function () { //IE
if (script.readyState == "loaded" || script.readyState == "complete") {
callback();
}
}
} else {
script.onload = function () {
callback();
}
}
script.src = url;
document.head.appendChild(script);
}


loadScript('demo.js', function () {
test();
});

九、拖拽

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
<!DOCTYPE html>
<html lang="en">

<head>
<meta charset="UTF-8">
<title>sc</title>
<style type="text/css">
div {
width: 100px;
height: 100px;
background-color: red;
position: absolute;
left: 0;
top: 0;
}
</style>
</head>

<body>
<div style="width: 100px;height: 100px;background: red;position: absolute;top: 0;left: 0;"></div>
<script type="text/javascript">
var div = document.getElementsByTagName('div')[0];

drag(div);


function cancelHandler(event) {
if (event.preventDefault) {
event.preventDefault();
} else {
event.returnValue = false;
}
} // 阻止默认事件

function stopBubble(event) {
if (event.stopPropagation) {
event.stopPropagation();
} else {
event.cancelBubble = true;
}
} // 阻止事件冒泡

function addEvent(elem, type, handle) {
if (elem.addEventListener) {
elem.addEventListener(type, handle, false);
} else if (elem.attachEvent) {
elem.attachEvent('on' + type, function () {
handle.call(elem);
})
} else {
elem['on' + type] = handle;
}
}

function removeEvent(elem, type, handle) {
if (elem.removeEventListener) {
elem.removeEventListener(type, handle, false);
} else if (elem.detachEvent) {
elem.detachEvent('on' + type, function () {
handle.call(elem);
})
} else {
elem['on' + type] = null;
}
} // IE低版本不可靠,待修改

function drag(elem) {
var disX, disY;
addEvent(elem, 'mousedown', function (e) {
var event = e || window.event;
disX = event.clientX - parseInt(elem.style.left);
disY = event.clientY - parseInt(elem.style.top);
addEvent(document, 'mousemove', mouseMove);
addEvent(document, 'mouseup', mouseUp);
stopBubble(event);
cancelHandler(event);

})

function mouseMove(e) {
var event = e || window.event;
elem.style.left = event.clientX - disX + "px";
elem.style.top = event.clientY - disY + "px";

}

function mouseUp(e) {
var event = e || window.event;
removeEvent(document, 'mousemove', mouseMove);
removeEvent(document, 'mouseup', mouseUp);
}
}
</script>
</body>

</html>
1…303132

gzl

96 日志
14 分类
37 标签
© 2020 gzl
由 Hexo 强力驱动 v3.7.1
|
主题 – NexT.Pisces v7.2.0