什么是爬虫

向网站发送Get请求获取页面html码,通过解析html码获取自己想要的信息

几个步骤

1.根据url获取HTML数据
2.解析HTML,获取目标信息(html教程)
3.处理并存储数据

示例

#导包
import requests 
from bs4 import BeautifulSoup  
# 伪装浏览器
head = {
    "User-Agent": "Mozilla / 5.0(Windows NT 10.0; Win64; x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 80.0.3987.122  Safari / 537.36"
    }
# response = requests.get("https://movie.douban.com/top250?start=0", headers=head) 第一页
# 循环遍历每一页数据
for start_num in range(0,250,25):#25为步长,每一页有25部电影,获取前十页
    response = requests.get(f"https://movie.douban.com/top250?start={start_num}", headers=head)
    html = response.text #保存获取的HTML 
    
    soup = BeautifulSoup(html,"html.parser")

    all_title = soup.find_all("span", attrs={"class":"title"})

    for title in all_title:
        # print(title) #打印看看
        # print(title.string)
        title_string = title.string
        if "/" not in title_string:
            print(title_string)#拿到某一页