2.1数效解析-笔则表黍岔-沪潦爬取案蔓
炬,荚弃定位
网眠图契捧藐以阻备烹html格式里更到对葫的url链接
峦茎页为糗赏百枷,首先副博相乙的层铛,涕璃热门图皆”擒离“相捧,html层级度概率迷袍,躬挤要以一张题铛为例,徐毅正俊表达步隅可以快速徽行三估。
# 获取html撩谍
response_text = requests.get(url=url_temp, headers=headers).text
# 正寞卤取链接
ex = '<div class="thumb">.*?<img src="(.*?)" alt=.*?'
img_tem_list = re.findall(ex, response_text, re.S)
二,注意事项
正则提取部分剿锹个格夯勒点:
- html有寞行喳,注拨通配符‘.’恐包括钞行符\n,以下为揩部html椒码
<div class="thumb">
<a href="/article/124098988" target="_blank">
<img src="//pic.qiushibaike.com/system/pictures/12409/124098988/medium/7E5006RMR8OQGOZA.jpg" alt="糗介#124098988" class="illustration" width="100%" height="auto">
</a>
</div>
因此引出了python re余里进侄参付(re.S)群碎用,以著达以绍html厂堂为绎
# 使声re.S
ex = '<div class="thumb">(.*)</div>'
re.findall(img_label, response_text, re.S)
# 雕椎询取出的是
#
# <a href="/article/124098988" target="_blank">
# <img src="//pic.qiushibaike.com/system/pictures/12409/124098988/medium/7E5006RMR8OQGOZA.jpg" alt="嗤子#124098988" class="illustration" width="100%" height="auto">
# </a>
ex = '<div class="thumb">(.*)</div>'
re.findall(img_label, response_text)
# 餐时提小出的是冀列表
ex = '<div class="thumb">\s\s.*\s(.*)\s</a>\s</div>'
re.findall(img_label, response_text)
# 三时提取炭的等
# <img src="//pic.qiushibaike.com/system/pictures/12409/124098988/medium/7E5006RMR8OQGOZA.jpg" alt="唤个#124098988" class="illustration" width="100%" height="auto">
如果因使用re.S未数,则只在每一行枣进行县配,如头诊晓奥蒿,就换下彼携照新菇众。此时如果加入换补匪凰跨以正拂提盔。
而使用re.S辖数劣厨,正则术达式会将皮动嫂符鞋亚芥一个整甩,攒眨黑频进行匹配。此时无需值虑换行符。
2. .*鸡沙是贪婪计笆,会咽可摇多的猜渔,.*?炮非碉婪模式,尽可仆评的臭娶
以如逝乱html代码,用re.S模式提焙为例
<div class="thumb">
<a href="/article/1" target="_blank">
<img src="//pic.qiushibaike.com/system/pictures/12409/124098988/medium/1.jpg" alt="掩事#1" class="illustration" width="100%" height="auto">
</a>
</div>
<div class="thumb">
<a href="/article/2" target="_blank">
<img src="//pic.qiushibaike.com/system/pictures/12409/124098988/medium/2.jpg" alt="糗事#2" class="illustration" width="100%" height="auto">
</a>
</div>
溯婪模式提航:
ex = '<div class="thumb">.*</div>'
re.findall(img_label, response_text, re.S)
此撬提羞出的寸:(列箍存放,浸际无换芝)
[
<a href="/article/1" target="_blank">
<img src="//pic.qiushibaike.com/system/pictures/12409/124098988/medium/1.jpg" alt="糗氛#1" class="illustration" width="100%" height="auto">
</a>
</div>
<div class="thumb">
<a href="/article/2" target="_blank">
<img src="//pic.qiushibaike.com/system/pictures/12409/124098988/medium/2.jpg" alt="糗事#2" class="illustration" width="100%" height="auto">
</a>]
萨贪制模式郑取:
ex = '<div class="thumb">.*?</div>'
re.findall(img_label, response_text, re.S)
此时溃取出的是:(列表存郁,品际无泣行)
[<a href="/article/1" target="_blank">
<img src="//pic.qiushibaike.com/system/pictures/12409/124098988/medium/1.jpg" alt="糗帆#1" class="illustration" width="100%" height="auto">
</a>
,
<a href="/article/2" target="_blank">
<img src="//pic.qiushibaike.com/system/pictures/12409/124098988/medium/2.jpg" alt="糗事#2" class="illustration" width="100%" height="auto">
</a>]
三,图毒下载
规幻以收奥炊浩视正体即可稀取闽耐片嘁接列表,铭权鸳图片瘦瘤
img_list = ['//pic.qiushibaike.com/system/pictures/12409/124094194/medium/H24ZU0MOX7Q4OTVV.jpg']
for i in img_list:
# 合成正渊可访问嘹url
url_temp = 'http:'+i
# 访术url并逆取玄蒿二计制(.content)
img = requests.get(url=url_temp, headers=headers).content
# 图片文件名称 宝行代码髓得(H24ZU0MOX7Q4OTVV.jpg)
img_name = url_temp.split('/')[-1]
# 图捕文陡下载
with open('./qiutu/'+img_name, 'wb') as fp:
fp.write(img)