admin 管理员组文章数量: 1086019
Python爬虫(下)(已完结)
书接上回
Python爬虫(上)
五、selenium库
1.selenium简介
1.1 什么是selenium?
Selenium是一个用于wcb应用程序测试的工具。
Selenium测试直接运行在浏览器中,就像真正的用户在操作一样。
支持通过各种driver (FirfoxDriver,IternetExplorerDriver,OperaDriver,ChromeDriver)驱动
真实浏览器完成测试。
selenium也是支持无界面浏览器操作的。
1.2 为什么使用selenium?
模拟浏览器功能,自动执行网页中的js代码,实现动态加载
2.selenium的安装与使用
操作谷歌浏览器驱动下载地址
.html
谷歌驱动和谷歌浏览器版本之间的映射表
查看谷歌浏览器版本
谷歌浏览器右上角-->帮助-->关于
安装selenium
在/python/Scripts/下 运行pip install selenium
导入selenium包
from selenium import webdriver
创建浏览器操作对象
path = 'chromedriver.exe的地址'
browser = webdriver.Chrome(path)
访问网站
url = ''
browser.get(url)
3.元素定位
3.1 根据标签属性的属性值来获取对象
3.1.1 根据id找到对象
from selenium import webdriver# chromedriver.exepath = 'chromedriver.exe'browser = webdriver.Chrome(path)url = ''browser.get(url=url)button = browser.find_element(by='id',value='su')
print(button)
# <selenium.webdriver.remote.webelement.WebElement (session="7bf780ef3a7a01665ebc7dd63bb4309b", element="d6d60e6e-f7a0-4170-9360-d9eea5c4c4a4")>
3.3.2 根据xpath语句获取对象
button = browser.find_element(by='xpath',value='//input[@id="su"]')
print(button)
# <selenium.webdriver.remote.webelement.WebElement (session="7bf780ef3a7a01665ebc7dd63bb4309b", element="d6d60e6e-f7a0-4170-9360-d9eea5c4c4a4")>
3.3.3 根据标签的名字获取对象
input = browser.find_element(by='tag name',value='input')
print(input)
# <selenium.webdriver.remote.webelement.WebElement (session="7bf780ef3a7a01665ebc7dd63bb4309b", element="a777d1fb-8802-41f0-a39b-2e1e1ee1b329")>
3.3.4 根据bs4语法获取标签
button = browser.find_element(by='css selector',value='#su')
print(button)
# <selenium.webdriver.remote.webelement.WebElement (session="7bf780ef3a7a01665ebc7dd63bb4309b", element="d6d60e6e-f7a0-4170-9360-d9eea5c4c4a4")>
3.3.5 根据a标签的值获取标签
button = browser.find_element(by='link text',value='新闻')
print(button)
# <selenium.webdriver.remote.webelement.WebElement (session="7bf780ef3a7a01665ebc7dd63bb4309b", element="548094f1-bf3d-412a-9d03-40ee6032fcef")>
4. 访问元素信息
4.1 获取id为su的元素的class属性的值
from selenium import webdriverpath = 'chromedriver.exe'browser = webdriver.Chrome(path)url = ''browser.get(url=url)input = browser.find_element(by='id',value='su')
print(input.get_attribute('class'))
# bg s_btn
4.2 获取标签名
print(input.tag_name)
# input
4.3 获取元素文本
button = browser.find_element(by='link text',value='新闻')
print(button.text)
# 新闻
5.交互
5.1 在文本框中输入内容
from selenium import webdriverpath = 'chromedriver.exe'browser = webdriver.Chrome(path)url = ''browser.get(url=url)# 获取文本框
input_txt = browser.find_element(by='id',value='kw')# 输入周杰伦
input_txt.send_keys('周杰伦')import time
# 睡眠2秒
time.sleep(2)
5.2 点击按钮
# 获取按钮
button_baidu = browser.find_element(by='id',value='su')# 点击按钮
button_baidu.click()
5.3 滑到网页底部
# 通话运行js代码实现
js_bottom = 'document.documentElement.scrollTop=100000'
browser.execute_script(js_bottom)
5.4 返回上一页
browser.back()
5.5 回退
browser.forward()
5.6 清空输入框
input_txt.clear()
5.7 退出
browser.quit()
5.8 练习 selenium完成自动化百度操作
实现操作:
打开百度->搜索周杰伦->翻到底部->点击下一页->点击回退->点击下一页->清空输入框->退出
from selenium import webdriverpath = 'chromedriver.exe'browser = webdriver.Chrome(path)url = ''browser.get(url=url)import time# 睡眠2秒
time.sleep(2)# 获取文本框
input_txt = browser.find_element(by='id',value='kw')# 输入周杰伦
input_txt.send_keys('周杰伦')time.sleep(2)# 获取百度一下按钮
button_baidu = browser.find_element(by='id',value='su')# 点击按钮
button_baidu.click()time.sleep(2)# 滑到底部
js_bottom = 'document.documentElement.scrollTop=100000'
browser.execute_script(js_bottom)time.sleep(2)# 获取下一页的按钮
button_next = browser.find_element(by='xpath',value='//a[@class="n"]')# 点击下一页
button_next.click()time.sleep(2)# 点击下一页
button_next.click()time.sleep(2)# 回到上一页
browser.back()time.sleep(2)# 返回
browser.forward()time.sleep(2)# 清空输入框
input_txt.clear()time.sleep(3)# 退出
browser.quit()
6、Chrome handless 无界面浏览器
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Servicedef share_browser() :path = Service(r'chromedriver.exe')options = Options()options.add_argument('--headless')options.add_argument('--disable-gpu')browser = webdriver.Chrome(options=options, service=path)return browserbrowser = share_browser()
url = ''
browser.get(url)
browser.save_screenshot('baidu.png')
六、requests库
安装requests库:在/python/Scripts 目录下 pip install requests
基本使用 一个类型和六个属性
1.1 一个类型
import requests
url = ''
response = requests.get(url=url)print(type(response))
# <class 'requests.models.Response'>
# Response类型
1.2 六个属性
1.2.1 设置响应的编码格式
response.encoding = 'utf-8'
1.2.2 以字符串的形式返回网页的源码
print(response.text)
1.2.3 返回url地址
print(response.url)
#
1.2.4 返回网页的源码(以二进制数据的形式)
print(response.content)
1.2.5 返回响应的状态码
print(response.status_code)
# 200
1.2.6 返回响应头
print(response.headers)
# {'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Mon, 20 Mar 2023 12:51:11 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:56 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu; path=/', 'Transfer-Encoding': 'chunked'}
get请求
使用get请求获取百度页面
import requestsurl = '?'headers = {'Cookie':'BIDUPSID=BC221A7A6E195D713FF461A23C0C6C03; PSTM=1660552946; BDUSS=h2dzlpZUJuWTNjM25LV1ctZHpKQXVjREhGc2lYb2VxZFV6Y2RvZ3lvNTNybFJqRVFBQUFBJCQAAAAAAAAAAAEAAAALqAno1rvKo7vY0uR3AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAHchLWN3IS1jdE; BDUSS_BFESS=h2dzlpZUJuWTNjM25LV1ctZHpKQXVjREhGc2lYb2VxZFV6Y2RvZ3lvNTNybFJqRVFBQUFBJCQAAAAAAAAAAAEAAAALqAno1rvKo7vY0uR3AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAHchLWN3IS1jdE; BAIDUID=009B581D121BAB0A028000A422796207:SL=0:NR=10:FG=1; BD_UPN=12314753; BAIDUID_BFESS=009B581D121BAB0A028000A422796207:SL=0:NR=10:FG=1; ZFY=lj:AKw8fsMYoAmQP38mmIPNebgbXLMwDG0N9ODwS5quI:C; sug=3; sugstore=0; ORIGIN=2; bdime=0; BD_HOME=1; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; BD_CK_SAM=1; PSINO=7; delPer=0; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BAIDU_WISE_UID=wapp_1679361289881_69; __bid_n=186718ae070167b4ec4207; FPTOKEN=HX9cJIAC1OsbehNXqos0lPZ/kLt5B6mSgi7zWXntOJnl4A6kF+y3jGiOQawhEPMcs0TweODc71y7H75PgfY2tyb6Y7AkqPGxfww+2VM3N0z0zh9sw82VrvG39nWSW2EcLh/vXTruFA3LOas/iF5Q8S/m5GlcL1iM6R8etzBqT2Ys+FalyyytWGJ5b8rjXk6DhUoPoUkxJMce9V2EjezsO+t2k+dE+LBWZRrHar8xANB/0VHMEICLaOOEueHnlgCOefbn0QNWyxeyZx9/8gSnwW0lBDCWx/APlOn9pFHsbmPHMg86HlOOIyivCnnJobN5XyCFkC3I2WMu3DHPFOBUGPNO8ayjKswvhDj9J84plb+A+hmwofCUzoQnNCZlMuFa3gH6hSPkYw5TjTnpcH/SoQ==|29Qf0FbXYRIMwpn44o4mB5+0O6zasymNL74pMZL59jw=|10|ae92cbfed415aa2e64c3be6f636f8677; arialoadData=false; COOKIE_SESSION=5_0_8_9_3_21_1_2_8_8_1_3_427489_0_21_0_1679296227_0_1679296206%7C9%23276178_339_1678349345%7C9; shifen[797184_97197]=1679361811; shifen[1720973_97197]=1679361815; BCLID=11675428486066043407; BCLID_BFESS=11675428486066043407; BDSFRCVID=Km4OJeCT5G09-drfyDIHuUzbzOVXzVJTTPjcTR5qJ04BtyCVcmimEG0PtOg3cu-M_EGSogKKy2OTH90F_2uxOjjg8UtVJeC6EG0Ptf8g0M5; BDSFRCVID_BFESS=Km4OJeCT5G09-drfyDIHuUzbzOVXzVJTTPjcTR5qJ04BtyCVcmimEG0PtOg3cu-M_EGSogKKy2OTH90F_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF=tJAO_C82tIP3fP36q45HMt00qxby26nBMgn9aJ5nQI5nhbvb3fnt2f3LbpoPhtjEBacvXfQFQUbmjRO206oay6O3LlO83h5M55rbKl0MLPbceIOn5DcDjJL73xnMBMn8teOnaIIM3fAKftnOM46JehL3346-35543bRTLnLy5KJYMDFRjj8KjjbBDHRf-b-XKD600PK8Kb7Vbp5gqfnkbft7jttjqCrb-Dc8KIjIbx50H4cO3l703xI73b3B5h3NJ66ZoIbPbPTTSROzMtcpQT8r5-nMJ-JtHmjdKl3Rab3vOPI4XpO1ej8zBN5thURB2DkO-4bCWJ5TMl5jDh3Mb6ksD-FtqjDefnIe_Ityf-3bfTrP-trf5DCShUFs5CuOB2Q-5M-a3KJOVU5ObRrUyJ_03q6ahfRpyTQpafbmLncjSM_GKfC2jMD32tbpK-bH5gTxoUJ2Bb05HPKz-6oh3hKebPRih6j9Qg-8opQ7tt5W8ncFbT7l5hKpbt-q0x-jLTnhVn0MBCK0hIKmD6_bj6oM5pJfetjK2CntsJOOaCvvDDbOy4oT35L1DauLKnjhMmnAaP5GMJo08UQy2ljk3h0rMxbnQjQDWJ4J5tbX0MQjDJTzQft20b0gbNb2-CruX2Txbb7jWhvBhl72y5u2QlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8IjH62btt_JJueoKnP; H_BDCLCKID_SF_BFESS=tJAO_C82tIP3fP36q45HMt00qxby26nBMgn9aJ5nQI5nhbvb3fnt2f3LbpoPhtjEBacvXfQFQUbmjRO206oay6O3LlO83h5M55rbKl0MLPbceIOn5DcDjJL73xnMBMn8teOnaIIM3fAKftnOM46JehL3346-35543bRTLnLy5KJYMDFRjj8KjjbBDHRf-b-XKD600PK8Kb7Vbp5gqfnkbft7jttjqCrb-Dc8KIjIbx50H4cO3l703xI73b3B5h3NJ66ZoIbPbPTTSROzMtcpQT8r5-nMJ-JtHmjdKl3Rab3vOPI4XpO1ej8zBN5thURB2DkO-4bCWJ5TMl5jDh3Mb6ksD-FtqjDefnIe_Ityf-3bfTrP-trf5DCShUFs5CuOB2Q-5M-a3KJOVU5ObRrUyJ_03q6ahfRpyTQpafbmLncjSM_GKfC2jMD32tbpK-bH5gTxoUJ2Bb05HPKz-6oh3hKebPRih6j9Qg-8opQ7tt5W8ncFbT7l5hKpbt-q0x-jLTnhVn0MBCK0hIKmD6_bj6oM5pJfetjK2CntsJOOaCvvDDbOy4oT35L1DauLKnjhMmnAaP5GMJo08UQy2ljk3h0rMxbnQjQDWJ4J5tbX0MQjDJTzQft20b0gbNb2-CruX2Txbb7jWhvBhl72y5u2QlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8IjH62btt_JJueoKnP; RT="z=1&dm=baidu&si=13a1c330-42ed-4127-b6f2-f24a8d4e32ad&ss=lfhkckh4&sl=g&tt=1ixr&bcn=https%3A%2F%2Ffclog.baidu%2Flog%2Fweirwood%3Ftype%3Dperf&ld=bjd8&ul=bu3x&hd=bu4e"; BA_HECTOR=2g8g2k8k04ala08l200l24bc1i1ikvd1n; H_PS_PSSID=38185_36550_38354_38366_37862_38170_38289_38246_36804_38261_37937_38312_38382_38285_38041_26350_37958_22159_38282_37881; H_PS_645EC=e73839RPUo9xfSdEWH%2BgZ2bKX7tRqQHy3xBHs9JSckkQDx6S5tfVDzVkc1R94pTfrrr5; BDSVRTM=397; baikeVisitId=0486a65c-3d0d-4c44-9543-410350269b72','User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36'
}data = {'wd':'北京'
}# url 请求路径
# params 参数
# kwargs 字典
response = requests.get(url=url,params=data,headers=headers)content = response.textwith open(f'baidu.html','w',encoding='utf-8') as fp :fp.write(content)
post请求
使用post请求爬取百度翻译结果
import requestsurl = '=en&to=zh'headers = {'Cookie':'BIDUPSID=BC221A7A6E195D713FF461A23C0C6C03; PSTM=1660552946; REALTIME_TRANS_SWITCH=1; FANYI_WORD_SWITCH=1; HISTORY_SWITCH=1; SOUND_SPD_SWITCH=1; SOUND_PREFER_SWITCH=1; BDUSS=h2dzlpZUJuWTNjM25LV1ctZHpKQXVjREhGc2lYb2VxZFV6Y2RvZ3lvNTNybFJqRVFBQUFBJCQAAAAAAAAAAAEAAAALqAno1rvKo7vY0uR3AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAHchLWN3IS1jdE; BDUSS_BFESS=h2dzlpZUJuWTNjM25LV1ctZHpKQXVjREhGc2lYb2VxZFV6Y2RvZ3lvNTNybFJqRVFBQUFBJCQAAAAAAAAAAAEAAAALqAno1rvKo7vY0uR3AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAHchLWN3IS1jdE; BAIDUID=009B581D121BAB0A028000A422796207:SL=0:NR=10:FG=1; APPGUIDE_10_0_2=1; Hm_lvt_64ecd82404c51e03dc91cb9e8c025574=1677725867,1679036239,1679280393; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BA_HECTOR=8g0h040g0h8l0404250124cu1i1it081n; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; delPer=0; PSINO=7; BAIDUID_BFESS=009B581D121BAB0A028000A422796207:SL=0:NR=10:FG=1; ZFY=lj:AKw8fsMYoAmQP38mmIPNebgbXLMwDG0N9ODwS5quI:C; H_PS_PSSID=38185_36550_38354_38366_37862_38170_38289_38246_36804_38261_37937_38312_38382_38285_38041_26350_37958_22159_38282_37881; BDSFRCVID=JJ-OJexroG07VWbfyqX-uUzbz_weG7bTDYrEOwXPsp3LGJLVFe3JEG0Pts1-dEu-S2OOogKKy2OTH90F_2uxOjjg8UtVJeC6EG0Ptf8g0M5; BDSFRCVID_BFESS=JJ-OJexroG07VWbfyqX-uUzbz_weG7bTDYrEOwXPsp3LGJLVFe3JEG0Pts1-dEu-S2OOogKKy2OTH90F_2uxOjjg8UtVJeC6EG0Ptf8g0M5; BCLID=8097868108805834232; BCLID_BFESS=8097868108805834232; H_BDCLCKID_SF=tbIJoDK5JDD3fP36q45HMt00qxby26n45Pj9aJ5nQI5nhKIzbb5t2f3LQloPhtjEBacvXfQFQUbmjRO206oay6O3LlO83h52aC5LKl0MLPbceIOn5DcYBUL10UnMBMn8teOnaIIM3fAKftnOM46JehL3346-35543bRTLnLy5KJYMDF4DTAhDT3QeaRf-b-XKD600PK8Kb7VbnDzeMnkbft7jttjahJPBKc8KpThbx50H4cO3l703xI73b3B5h3NJ66ZoIbPbPTTSR3TK4OpQT8r5-nMJ-JtHCutLxbCab3vOPI4XpO1ej8zBN5thURB2DkO-4bCWJ5TMl5jDh3Mb6ksD-FtqjttJnut_KLhf-3bfTrP-trf5DCShUFsB-uJB2Q-5M-a3KtBKJb4bRrUyfk03q6ahfRpyTQpafbmLncjSM_GKfC2jMD32tbp5-r0amTxoUJ2Bb05HPKzXqnpQptebPRih6j9Qg-8opQ7tt5W8ncFbT7l5hKpbt-q0x-jLTnhVn0MBCK0hD89DjKKD6PVKgTa54cbb4o2WbCQXMOm8pcN2b5oQTOW3RJaKJcDM6Qt-4ctMf7beq06-lOUWJDkXpJvQnJjt2JxaqRC3JK5Ol5jDh3MKToDb-oteltHB2Oy0hvcMCocShPwDMjrDRLbXU6BK5vPbNcZ0l8K3l02V-bIe-t2XjQhDH-OJ6DHtJ3aQ5rtKRTffjrnhPF3yxTDXP6-hnjy3b4f-f3t5tTao56G3J6D2l4Wbttf5q3Ry6r42-39LPO2hpRjyxv4Q4Qyy4oxJpOJ-bCL0p5aHx8K8p7vbURv2jDg3-A8JU5dtjTO2bc_5KnlfMQ_bf--QfbQ0hOhqP-jBRIE3-oJqC_-MKt93D; H_BDCLCKID_SF_BFESS=tbIJoDK5JDD3fP36q45HMt00qxby26n45Pj9aJ5nQI5nhKIzbb5t2f3LQloPhtjEBacvXfQFQUbmjRO206oay6O3LlO83h52aC5LKl0MLPbceIOn5DcYBUL10UnMBMn8teOnaIIM3fAKftnOM46JehL3346-35543bRTLnLy5KJYMDF4DTAhDT3QeaRf-b-XKD600PK8Kb7VbnDzeMnkbft7jttjahJPBKc8KpThbx50H4cO3l703xI73b3B5h3NJ66ZoIbPbPTTSR3TK4OpQT8r5-nMJ-JtHCutLxbCab3vOPI4XpO1ej8zBN5thURB2DkO-4bCWJ5TMl5jDh3Mb6ksD-FtqjttJnut_KLhf-3bfTrP-trf5DCShUFsB-uJB2Q-5M-a3KtBKJb4bRrUyfk03q6ahfRpyTQpafbmLncjSM_GKfC2jMD32tbp5-r0amTxoUJ2Bb05HPKzXqnpQptebPRih6j9Qg-8opQ7tt5W8ncFbT7l5hKpbt-q0x-jLTnhVn0MBCK0hD89DjKKD6PVKgTa54cbb4o2WbCQXMOm8pcN2b5oQTOW3RJaKJcDM6Qt-4ctMf7beq06-lOUWJDkXpJvQnJjt2JxaqRC3JK5Ol5jDh3MKToDb-oteltHB2Oy0hvcMCocShPwDMjrDRLbXU6BK5vPbNcZ0l8K3l02V-bIe-t2XjQhDH-OJ6DHtJ3aQ5rtKRTffjrnhPF3yxTDXP6-hnjy3b4f-f3t5tTao56G3J6D2l4Wbttf5q3Ry6r42-39LPO2hpRjyxv4Q4Qyy4oxJpOJ-bCL0p5aHx8K8p7vbURv2jDg3-A8JU5dtjTO2bc_5KnlfMQ_bf--QfbQ0hOhqP-jBRIE3-oJqC_-MKt93D; ab_sr=1.0.1_MTVjY2NjMmVjZWI3NDkwMTQwY2IyNjhlODUzODZmMWVmZTY5YTUyNzQyYTkxNjFkZWJkODc4ZGM4YjQ4YjRhYTE1NDQ1ZmQ2NDdjMmVlNGRhMmQ4NzMwZTg0YjYwNGRmMzI4NGUyY2RiYjc2MjgwODM2ZTZiMDEzYmRjNDM2ZjY2NDljYzI4ZWE4OGRhYThiYzc2M2YzNmYzNWYyMzE3ZDZkNjUyMzIwMDE4YjI2ZTI0MTg1ODg5M2ZiZDAzOTFi; Hm_lpvt_64ecd82404c51e03dc91cb9e8c025574=1679391374','User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36',
}data = {'from':'en','to':'zh','query':'lover','simple_means_flag':'3','sign':'779242.983259','token':'fc3e9c7437f19e66e7c851920a776f05','domain':'common',
}# url 请求路径
# data 参数
# kwargs 字典
response = requests.post(url=url,data=data,headers=headers)
response.encoding = 'utf-8'content = response.textimport jsonobj = json.loads(content)print(obj)
注意:
(1)post请求是不需要编解码
(2)post请求的参数是data
(3)不需要请求对象的定制
cookie登录古诗文网页
通过错误登录的方式查找接口,找到需要的参数
__VIEWSTATE: Ex1PVx+h6L9R6etKhj960tstXW3+ejGnUvV/SODQ04iUB5sJ15hNVPO6ZP33bPkhe2LPrNLTKSY+pPgxVCNSXIg9xyl0lpOo1XaUvoCoXtZaRnKctEI+vbbrfaYLQ+RRKmkm8NK5exHqSzhok/cQB/Ch5wQ=
__VIEWSTATEGENERATOR: C93BE1AE
from: .aspx
email: 19909602290
pwd: wu111111
code: M4j7
denglu: 登录
观察参数发现 __VIEWSTATE,__VIEWSTATEGENERATOR,code为变量
难点:
(1)未知的值 __VIEWSTATE,__VIEWSTATEGENERATOR
解决方案:
一般情况下看不到的数据都是隐藏在页面源码中,可以通过获取页面源码获取到隐藏的值
(2)验证码 code
解决方案:
获取验证码图片 --》 观察或图像解析获取验证码code
import requests# 登录页面url地址
url = '.aspx?from=.aspx'headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36',
}response = requests.get(url=url,headers=headers)# 获取网页源码
content = response.text# 解析页面源码 获取__VIEWSTATE,__VIEWSTATEGENERATOR
from bs4 import BeautifulSoupsoup = BeautifulSoup(content,'lxml')# 获取__VIEWSTATE
viewstate = soup.select('#__VIEWSTATE')[0].attrs.get('value')# 获取__VIEWSTATEGENERATOR
viewstategenerator = soup.select('#__VIEWSTATEGENERATOR')[0].attrs.get('value')# 获取验证码图片
codeimg = soup.select('#imgCode')[0].attrs.get('src')
codeimgUrl = '' + codeimg# 获取验证码图片后,下载到本地,然后观察验证码,观察之后然后控制台输入这个验证码
# import urllib.request# urllib.request.urlretrieve(url=codeimgUrl,filename='requests库\code.jpg')
# 使用urllib会在获取下载图片时请求一次验证码,下方再次登录页面请求时会生成新的验证码,获取到的验证码失效
# 结论:urllib.request.urlretrieve方法不可用# session()方法
# requests里面有一个方法 sessoin() 通过session的返回值 就能使用请求变成一个对象
session = requests.session()# 验证码的url的内容
response_code = session.get(codeimgUrl)
# 注意此时要用二进制数据,因为此时使用的是图片的下载,而图片的生成使用的是二进制
content_code = response_code.content# wb 的模式就是将二进制数据写入到文件
with open('requests库\code.jpg','wb') as fp :fp.write(content_code)# 肉眼观察获取验证码
code_name = input('请输入你的验证码:')# 点击登录请求的url
url_post = '.aspx?from=http%3a%2f%2fso.gushiwen%2fuser%2fcollect.aspx'data_post = {'__VIEWSTATE':viewstate,'__VIEWSTATEGENERATOR':viewstategenerator,'from:http':'//so.gushiwen/user/collect.aspx','email':'19909602290','pwd':'wu101042','code':code_name,'denglu':'登录',
}response_post = session.post(url=url,headers=headers,data=data_post)content_post = response_post.textwith open('requests库\gushiwen.html','w',encoding='utf-8') as fp :fp.write(content_post)
超级战鹰平台机器识别图片验证码
参考网页:
七、scrapy框架
定义:Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框卖
可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中。
scrapy的安装
在/python/Scripts 目录下 pip install scrapy
容易出错的点:
报错:building 'twisted.test.raiser'extension
解决方法:
在该页面下载相应报错的twisted包
cp:python版本
win:windows版本
scrapy项目的创建与运行
2.1 创建爬虫项目
scrapy startproject 项目名
注意:项目名不允许使用数字开头,也不能包含中文
2.2 创建爬虫文件
在项目文件-> spiders下使用cmd创建爬虫文件
注意:创建爬虫文件命名
scrapy genspider 爬虫文件的名字 要爬取的网页
eg:scrapy genspider baidu
2.3 运行爬虫代码
scrapy crawl 爬虫的名字(name)
直接运行不能正常执行,是因为默认设置遵守robots协议,不能爬取部分网页
解决方法:
1.在..\爬虫\scrapy_baidu\scrapy_baidu\settings.py中注释掉ROBOTSTXT_OBEY = True 取消遵守robots协议
2.运行 scrapy crawl 爬虫的名字(name)
3.scrapy项目的结构
项目名字
项目名字
spiders文件夹
__init__.py(存储的是爬虫文件)
__init__.py
items文件(定义数据结构的地方 爬取的数据都包含哪些)
middleware(中间件 代理)
settings(配置文件 roots协议 ua定义等)
4.response的属性和方法
获取字符串数据
content = response.text
获取二进制数据
content = response.body
使用xpath方法来解析response中的内容
response.xpath()
提取selector对象的data属性值
response.extract()
提取selector列表中的第一个数据
response.extract_first()
5.练习 下载汽车之家车辆信息
# 创建项目文件
scrapy startproject scrapy_qczj
# 创建爬虫文件
scrapy genspider bm .html
# 运行爬虫文件
scrapy crawl bm
bm.py内容如下:
import scrapyclass BmSpider(scrapy.Spider):name = "bm"allowed_domains = [".html"]# 注意当网页的后缀为html结尾,则网址最后不能加/start_urls = [".html"]def parse(self, response):bm_imgUrl = response.xpath('//div[@class="list-cont-img"]//img/@src')bm_name = response.xpath('//div[@class="list-cont-main"]/div[@class="main-title"]/a/text()')bm_price = response.xpath('//div[@class="list-cont-main"]/div[@class="main-lever"]//span[@class="font-arial"]/text()')print('=================')# print(bm_price.extract())# print(len(bm_imgUrl))for i in range(len(bm_imgUrl)) :url = 'http:' + bm_imgUrl[i].extract()name = bm_name[i].extract()price = bm_price[i].extract()with open('D:/study/Python/爬虫/scrapy框架/qczj/qczj/spiders/cars.txt','a',encoding='utf-8') as fp :fp.write('网址:' + url + '\n' + '系列:' + name + '\n' +'价格:' + price + '\n\n')
6.yield 多管道分页下载 当当网数据
6.1 文件的用途
pipelines.py 用于下载数据,可定义多管道下载
item.py 用于定义需要下载数据的数据结构
6.2 定义数据结构并爬取数据
1.创建爬虫文件
scrapy startproject dangdang
2.创建爬虫文件
在spiders目录下执行
scrapy genspider dang .html
3.编写文件
itme.py文件
class DangdangItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()# 需要下载数据的数据结构# 图片src = scrapy.Field()# 名字name = scrapy.Field()# 价格price = scrapy.Field()
dang.py文件
注意:有的seletor的对象 都可以再次调用xpath方法
import scrapyclass DangSpider(scrapy.Spider):name = "dang" allowed_domains = [".html"]start_urls = [".html"]def parse(self, response):print('=====================')# 所有的seletor的对象 都可以再次调用xpath方法bookcover = response.xpath('//div[@id="book_list"]//span[@class="bookcover"]')bookinfo = response.xpath('//div[@id="book_list"]//div[@class="bookinfo"]')for i in range(len(bookcover)) :src = bookcover.xpath('./img[not(@class="promotion_label")]/@src')[i].get()name = bookinfo.xpath('./div[@class="title"]/text()')[i].get()price = bookinfo.xpath('./div[@class="price"]/span/text()')[i].get()print(src + '\n' + name + '\n' + price + '\n')
6.3 多管道封装
dang.py文件
import scrapyclass DangSpider(scrapy.Spider):name = "dang" allowed_domains = [".html"]start_urls = [".html"]def parse(self, response):print('=====================')# 所有的seletor的对象 都可以再次调用xpath方法bookcover = response.xpath('//div[@id="book_list"]//span[@class="bookcover"]')bookinfo = response.xpath('//div[@id="book_list"]//div[@class="bookinfo"]')for i in range(len(bookcover)) :src = bookcover.xpath('./img[not(@class="promotion_label")]/@src')[i].get()name = bookinfo.xpath('./div[@class="title"]/text()')[i].get()price = bookinfo.xpath('./div[@class="price"]/span/text()')[i].get()# 定义对象使用item文件中定义数据结构的函数# 导入item.py中的函数from dangdang.items import DangdangItembook = DangdangItem(src = src,name = name,price = price)# 获取一个book并将book交给pipelinesyield book
4.编写管道操作文件
如果想使用管道,就必须在setting中开启管道
setting.py
# 解开注释
ITEM_PIPELINES = {# 管道是有优先级的,范围是1到1000,值越小优先级越高"dangdang.pipelines.DangdangPipeline": 300,"dangdang.pipelines.DangdangImgPipeline": 301
}
pipelines.py文件
from itemadapter import ItemAdapter
import urllib.request# 如果想使用管道,就必须在setting中开启管道
# 下载书籍信息
class DangdangPipeline:# 打开文件def processopen_item(self, item, spider):self.fp = open('D:/study/Python/爬虫/scrapy框架/dangdang/dangdang/spiders/book.json','w',encoding='utf-8')# item就是yield后面的book对象def process_item(self, item, spider):
# D:\study\Python\爬虫\scrapy框架\dangdang\dangdang\spiders\dang.pyself.fp.write(str(item))print('************')return item# 关闭文件def processclose_item(self, item, spider):self.fp.close()# urlretrieve下载封面图片
class DangdangImgPipeline:def process_item(self, item, spider):url = item.get('src')filename = 'D:/study/Python/爬虫/scrapy框架/dangdang/dangdang/BookImg/' + item.get('name') + '.jpg'urllib.request.urlretrieve(url=url,filename=filename)return item
6.4 多页数据下载
dang.py文件
import scrapy
from dangdang.items import DangdangItemclass DangSpider(scrapy.Spider):name = "dang" # 如果是多页下载的话 必须调整allowed_domains的范围 一般情况下只写域名allowed_domains = ["e.dangdang"]start_urls = [".html"]base_url = ''page = 1def parse(self, response):# pipelines 下载数据# item 定义数据结构 print('=====================')# 所有的seletor的对象 都可以再次调用xpath方法bookcover = response.xpath('//div[@id="book_list"]//span[@class="bookcover"]')bookinfo = response.xpath('//div[@id="book_list"]//div[@class="bookinfo"]')for i in range(len(bookcover)) :src = bookcover.xpath('./img[not(@class="promotion_label")]/@src')[i].get()name = bookinfo.xpath('./div[@class="title"]/text()')[i].get()price = bookinfo.xpath('./div[@class="price"]/span/text()')[i].get()# print(src + '\n' + name + '\n' + price + '\n') book = DangdangItem(src = src,name = name,price = price)# 获取一个book就将book交给pipelinesyield book# 多页下载if self.page < 2 :self.page = self.page + 1url = self.base_url + str(self.page) + '.html'# scrapy.Request就是scrpay的get请求yield scrapy.Request(url=url,callback=self.parse)
7.CrawSpider 链接提取器
7.1 CrawSpider项目的安装
创建项目 scrapy startproject 项目名称
跳转到spider文件夹下
创建爬虫文件
scrapy genspider -t crawl 爬虫文件名字 爬取的域名
7.2 练习 读书网数据入库与连接跟进
items.py
# Define here the models for your scraped items
#
# See documentation in:
# .htmlimport scrapyclass DushuwangItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()name = scrapy.Field()src = scrapy.Field()
dushu.py文件
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from dushuwang.items import DushuwangItemclass DushuSpider(CrawlSpider):name = "dushu"allowed_domains = ["www.dushu"]start_urls = [".html"]# 链接跟进rules = (Rule(LinkExtractor(allow=r"/book/1087_\d+\.html"), callback="parse_item", follow=True),)def parse_item(self, response):img_list = response.xpath('//div[@class="bookslist"]/ul//img')for img in img_list:name = img.xpath('./@alt').extract()src = img.xpath('./@data-original').extract()book = DushuwangItem(name=name,src=src)yield book
settings.py文件
DB_HOST ='localhost'
DB_POST = 3306
DB_USER = 'root'
DB_PASSWORD = '****'
DB_NAME = 'python'
DB_CHARSET = 'utf8'# Configure item pipelines
# See .html
ITEM_PIPELINES = {"dushuwang.pipelines.DushuwangPipeline": 300,"dushuwang.pipelines.MYSQLpiplines":301
}
piplines.py文件
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: .html# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from scrapy.utils.project import get_project_settings
import pymysqlclass DushuwangPipeline:def open_spider(self, spider):self.fp = open('D:/study/Python/爬虫/scrapy框架/dushuwang/dushuwang/spiders/book.json','w',encoding='utf-8')def process_item(self, item, spider):self.fp.write(str(item))return itemdef close_spider(self,spider):self.fp.close()# 数据入库
class MYSQLpiplines:def open_spider(self, spider):print("***************")settings = get_project_settings()self.host = settings['DB_HOST']self.port = settings['DB_POST']self.user = settings['DB_USER']self.password = settings['DB_PASSWORD']self.name = settings['DB_NAME']self.charset = settings['DB_CHARSET']self.conn = pymysql.connect(host=self.host,port=self.port,user=self.user,password=self.password,db=self.name,charset=self.charset)self.cur = self.conn.cursor()def process_item(self, item, spider):sql = 'insert into book_info(name,src) values("{}","{}")'.format(item['name'],item['src'])# 执行sqlself.cur.execute(sql) # # 提交self.connmit()return itemdef close_spider(self,spider):self.cur.close()self.conn.close()
7.3 日志信息与日志等级
7.3.1 日志级别:
CRITICAL:严重错误
ERROR:一般错误
WARNING:警告
INFO:一般信息
DEBUG:调试信息
默认的日志等级是DEBUG
只要出现了DEBUG或者DEBUG以上等级的日志,那么这些日志将会打印
7.3.2 设置日志等级
settings.py 文件
LOG_LEVEL = 'WAGNING'
一般修改调试等级,而是将日志放入日志文件中
7.3.3 设置日志文件
settings.py 文件
LOG_FILE = 'logdemo.log'
8.scrapy的post请求
在post请求 如果没参数 那么这个请求将没有任何意义
所以这个start_urls 也就没有用,parse 方法也就没有用
在post请求中则使用start_requests方法来替换parse方法
在该方法中自定义url与date参数
8.1 练习 爬取百度翻译
fanyipost.py文件
import scrapy
import jsonclass FanyipostSpider(scrapy.Spider):name = "fanyipost"allowed_domains = [""]# post请求 如果没参数 那么这个请求将没有任何意义# 所以这个start_urls 也就没有用# parse 方法也就没有用# start_urls = ["/"]# def parse(self, response):# passdef start_requests(self):url = "/"data = {'kw':'lover'}yield scrapy.FormRequest(url=url,formdata = data,callback=self.parse_second)def parse_second(self,response):content = response.textobj = json.loads(content)print(obj)
settings.py文件
取消robots协议
# Obey robots.txt rules
# ROBOTSTXT_OBEY = True
本文标签: Python爬虫(下)(已完结)
版权声明:本文标题:Python爬虫(下)(已完结) 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.roclinux.cn/b/1693581170a230484.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论