Python爬虫
什么是爬虫
向网站发送Get请求获取页面html码,通过解析html码获取自己想要的信息
几个步骤
1.根据url获取HTML数据
2.解析HTML,获取目标信息(html教程)
3.处理并存储数据
示例
#导包
import requests
from bs4 import BeautifulSoup
# 伪装浏览器
head = {
"User-Agent": "Mozilla / 5.0(Windows NT 10.0; Win64; x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 80.0.3987.122 Safari / 537.36"
}
# response = requests.get("https://movie.douban.com/top250?start=0", headers=head) 第一页
# 循环遍历每一页数据
for start_num in range(0,250,25):#25为步长,每一页有25部电影,获取前十页
response = requests.get(f"https://movie.douban.com/top250?start={start_num}", headers=head)
html = response.text #保存获取的HTML
soup = BeautifulSoup(html,"html.parser")
all_title = soup.find_all("span", attrs={"class":"title"})
for title in all_title:
# print(title) #打印看看
# print(title.string)
title_string = title.string
if "/" not in title_string:
print(title_string)#拿到某一页
本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 xunyue!