首页技术日记正文内容

Python爬虫入门:如何爬取招聘网站并进行分析

技术日记

更新时间：2025-05-02 23:30:03 23

admin 管理员组

文章数量: 1086019

2024年3月7日发(作者：java入门推荐)

0 前言

工作之余，时常会想能做点什么有意思的玩意。互联网时代，到处都是互联网思维，大数据、深度学习、人工智能，这些新词刮起一股旋风。所以笔者也赶赶潮流，买了本Python爬虫书籍研读起来。

网络爬虫，顾名思义就是将互联网上的内容按照自己编订的规则抓取保存下来。理论上来讲，浏览器上只要眼睛能看到的网页内容都可以抓起保存下来，当然很多网站都有自己的反爬虫技术，不过反爬虫技术的存在只是增加网络爬虫的成本而已，所以爬取些有更有价值的内容，也就对得起技术得投入。

1案例选取

人有1/3的时间在工作，有一个开心的工作，那么1/3的时间都会很开心。所以我选取招聘网站来作为我第一个学习的案例。

前段时间和一个老同学聊天，发现他是在从事交互设计（我一点也不了解这是什么样的岗位），于是乎，我就想爬取下前程无忧网（招聘网_人才网_找工作_求职_上前程无忧）上的交互设计的岗位需求：

2实现过程

我这里使用scrapy框架来进行爬取。

2.1程序结构

C:Usershyperstrongspiderjob_jiaohusheji

│

└─spiderjob

│

│ __init__.py

│

├─spiders

│

│ __init__.py

其中：



是从网页抽取的项目

是主程序

2.2链接的构造

用浏览器打开前程无忧网站招聘网_人才网_找工作_求职_上前程无忧，在职务搜索里输入“交互设计师”，搜索出页面后，观察网址链接：

【交互设计师招聘】前程无忧手机网_触屏版

/jobsearch/search_?fromJs=1&keyword=%E4%BA%A4%E4%BA%92%E8%AE%BE%E8%AE%A1%E5%B8%88&keywordtype=2&lang=c&stype=2&postchannel=0000&fromType=1&confirmdate=9

网址链接中并没有页码，于是选择第二页，观察链接:

红色标记的为页码，于是可以通过更改此处数字来实现从“第1页”到第44页“的网页自动跳转。当然读者也可以通过网页内容抓取处下一页的链接进行自动翻页，有兴趣的网友可以试下：

2.3网页分析

我要抓取的几个数据分别是



职位名

公司名

工作地点



薪资

发布时间

截图如下，右侧是浏览器-开发者工具（F12）里查找的源代码，和网页对应查看：

2.4数据字段：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

# See documentation in:

# /en/latest/topics/

import scrapy

class SpiderjobItem():

# define the fields for your item here like:

# name = ()

jobname = ()

companyname= ()

workingplace= ()

salary= ()

posttime= ()

2.5主要运行程序

我是用的python2.7编写的，并且使用XPath表达式进行数据的筛选和提取。

# -*- coding: utf-8 -*-

from scrapy import Request

from s import Spider

from import SpiderjobItem

class jobSpider(Spider):

name = 'jobSpider'

headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64)

AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75

Safari/537.36 LBBROWSER',

'Accept':'text/css,*/*;q=0.1',

'Accept-Encoding':'gzip, deflate, sdch',

'Accept-Language':'zh-CN,zh;q=0.8',

'Referer':'close',

'Host':''};

def start_requests(self):

url1 =

'/list/000000,000000,0000,00,9,99,%25E4%25BA%25A4%25E4%25BA%2592%25E8%25AE%25BE%25E8%25AE%25A1%25E5%25B8%2588,2,'

url2 =

'.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=1&dibiaoid=0&address=&line=&specialarea=00&from=&welfare='

url = url1 +'1'+ url2

yield Request(url, headers=s)

def parse(self, response):

item = SpiderjobItem()

jobs =

('//div[@class="dw_table"]/div[@class="el"]')

for job in jobs:

item['companyname'] = (

'.//span[@class="t2"]/a[@target="_blank"]/text()').extract()[0]

item['workingplace'] = (

'.//span[@class="t3"]/text()').extract()[0]

item['salary'] = (

'.//span[@class="t4"]/text()').extract()

item['posttime'] =

('.//span[@class="t5"]/text()').extract()[0]

item['jobname'] = (

'.//p[@class="t1

"]/span/a[@target="_blank"]/text()').extract()[0]

yield item

for i in range(2,44):

url1 =

'/list/000000,000000,0000,00,9,99,%25E4%25BA%25A4%25E4%25BA%2592%25E8%25AE%25BE%25E8%25AE%25A1%25E5%25B8%2588,2,'

url2 =

next_url = url1 +str(i)+ url2

yield Request(next_url,

headers=s,callback=)

2.6抓取效果：

在开始运行里输入里cmd，修改路径为C:Usershyperstrongspiderjob_jiaohusheji 。然后输入scrapy crawl

jobSpder -o

3数据进行简单分析



从excel表格里抽取2个特征：薪资和城市

分析不同城市的交互设计岗位平均薪资

分析不同城市对于交互设计岗位需求，即在该城市是否容易找到工作

说干就干，代码奉上：

#!/usr/bin/python

# -*- coding: UTF-8 -*-

import pandas as pd

import numpy as np

from datetime import datetime

import as plt

import sys

import re

import csv

import string

def analyze_job_demand (filepath):

data = _csv(filepath)

wp=[]

num=len(data['workingplace'])

for i in range(0,num-1):

a=data['workingplace'].ix[i].decode('utf-8')

b=a[0:2].encode('utf-8')

(b)

bj=('北京')

sh=('上海')

gz=('广州')

sz=('深圳')

wh=('武汉')

cd=('成都')

cq=('重庆')

zz=('郑州')

nj=('南京')

sz1=('苏州')

hz=('杭州')

xa=('西安')

dl=('大连')

qd=('青岛')

cs=('长沙')

nc=('南昌')

hf=('合肥')

nb=('宁波')

km=('昆明')

last=num-bj-sh-gz-sz-wh-cd-cq-nj-sz1-hz-xa-cs-hf

print( u'武汉的交互设计相关岗位占全国的需求比例为:' +

str(float(wh)/num*100)+'%')

print( u'苏州的交互设计相关岗位占全国的需求比例为:' +

str(float(sz1)/num*100)+'%')

print( u'杭州的交互设计相关岗位占全国的需求比例为:' +

str(float(hz)/num*100)+'%')

print( u'合肥的交互设计相关岗位占全国的需求比例为:' +

str(float(hf)/num*100)+'%')

print( u'长沙的交互设计相关岗位占全国的需求比例为:' +

str(float(cs)/num*100)+'%')

print( u'北京的交互设计相关岗位占全国的需求比例为:' +

str(float(bj)/num*100)+'%')

print( u'上海的交互设计相关岗位占全国的需求比例为:' +

str(float(sh)/num*100)+'%')

print( u'广州的交互设计相关岗位占全国的需求比例为:' +

str(float(gz)/num*100)+'%')

print( u'深圳的交互设计相关岗位占全国的需求比例为:' +

str(float(sz)/num*100)+'%')

print( u'重庆的交互设计相关岗位占全国的需求比例为:' +

str(float(cq)/num*100)+'%')

print( u'成都的交互设计相关岗位占全国的需求比例为:' +

str(float(cd)/num*100)+'%')

print( u'南京的交互设计相关岗位占全国的需求比例为:' +

str(float(nj)/num*100)+'%')

print( u'西安的交互设计相关岗位占全国的需求比例为:' +

str(float(xa)/num*100)+'%')

#绘制饼图

#调节图形大小，宽，高

(figsize=(6,9))

#定义饼状图的标签，标签是列表

labels =

['shanghai','shenzhen','beijing','guangzhou','hangzhou','wuhan

','chengdu','chongqing','nanjing','suzhou','xian','changsha','hefei','else']

sizes = [sh,sz,bj,gz,hz,wh,cd,cq,nj,sz1,xa,cs,hf,last]

colors =

['red','yellowgreen','lightskyblue','blue','pink','coral','orange']

#将某部分爆炸出来，

使用括号，将第一块分割出来，数值的大小是分割出来的与其他两块的间隙

explode = (0.05,0,0,0,0,0,0,0,0,0,0,0,0,0)

patches,l_text,p_text =

(sizes,explode=explode,labels=labels,colors=colors,

labeldistance = 1.1,autopct =

'%3.1f%%',shadow = False,

startangle = 90,pctdistance = 0.6)

#labeldistance，文本的位置离远点有多远，1.1指1.1倍半径的位置

#autopct，圆里面的文本格式，%3.1f%%表示小数有三位，整数有一位的浮点数

#shadow，饼是否有阴影

#startangle，起始角度，0，表示从0开始逆时针转，为第一块。一般选择从90度开始比较好看

#pctdistance，百分比的text离圆心的距离

#patches, l_texts, p_texts，为了得到饼图的返回值，p_texts饼图内部文本的，l_texts饼图外label的文本

#改变文本的大小

#方法是把每一个text遍历。调用set_size方法设置它的属性

for t in l_text:

_size=(30)

for t in p_text:

_size=(40)

设置x，y轴刻度一致，这样饼图才能是圆的

('equal')

#图形中的文字无法通过rcParams设置

()

def analyze_salary(filepath):

data = _csv(filepath)

chengshi=[u'北京',u'上海',u'广州',u'深圳',u'武汉',u'成都',u'重庆',u'郑州',u'南京',u'苏州',u'杭州',u'西安',u'大连',u'青岛',u'长沙',u'南昌',u'合肥',u'宁波',u'昆明']

city_salary=[]

salary=[]

num=len(data['workingplace'])

for j in range(0,18):

city=chengshi[j]

for i in range(0,num-1):

a=data['workingplace'].ix[i].decode('utf-8')

if (city)!=-1:

if data['salary'].ix[i] :

c=str(data['salary'].ix[i])

d=('utf-8')

if (u'万/月')!=-1:

pattern=e('d+.?d*',re.S)

items = l(pattern,c)

ave=(float(items[0])+float(items[1]))/2

sal=float(ave)*10

elif (u'千/月')!=-1:

pattern=e('d+.?d*',re.S)

items = l(pattern,c)

ave=(float(items[0])+float(items[1]))/2

sal=float(ave)*1

elif (u'万/年')!=-1:

pattern=e('d+.?d*',re.S)

items = l(pattern,c)

ave=(float(items[0])+float(items[1]))/2

sal=float(ave)*0.8333

(sal)

ave=sum(salary)/len(salary)

print( chengshi[j] +u'的交互设计相关岗位的平均工资为:' +

str(ave) +u'千/月')

city_(ave)

salary=[]

with

open('C:Usershyperstrongspiderjob_',

'wb') as f:

writer = (f)

chengshi_encode=[]

for k in range(0,18):

chengshi_(chengshi[k].encode('utf-8'))

ow(chengshi_encode)

ow(city_salary)

()

if __name__ == '__main__':

filepath = input('Please enter the filename:')

analyze_salary(filepath)

analyze_job_demand (filepath)

运行程序，输入excel的路径：r'C:Usershyperstrongspiderjob_'（注意路径前加

r，去掉转移字符）

效果如下：

不同城市的岗位需求量占全国总需求的比例，自动生产的饼图：

保存的excel绘制出不同城市的该岗位的平均薪资的柱状图，如下：

从以上几图可以看出：

1. 北京的交互设计平均工资最高

2. 杭州的交互设计需求和平均工资都已经是一线城市的水平，怪不得很多IT人才往杭州跑

3. 江浙沪的需求量占了半壁江山

结束语

弄了这么半天，从城市需求、平均工资方面我已经对这个岗位有了初步认识。如果我是一个即将毕业并想从事该工作的大学生，看到这些，应该会有些帮助。不过实际选择会更加困难，不同的城市竞争是不一样的，房贷压力、幸福指数、城市的行业分布、城市未来发展潜力都不不一样。于是我想到：如何从不同城市各个行业求职情况，来看出城市的幸福指数、发展潜力。找个时间研究下~~~///(^v^)~~~

本文标签：交互设计岗位相关城市

版权声明：本文标题：Python爬虫入门:如何爬取招聘网站并进行分析内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.roclinux.cn/p/1709807020a547052.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

发表评论

全部评论 0

暂无评论

Linux大棚 – 不忘初心的技术博客，浮躁时代的安静角落

Python爬虫入门:如何爬取招聘网站并进行分析

更多相关文章

计算机语言(或称程序设计语言)的发展过程

软件工程复习资料(整理)

软件工程选择题

软件工程习题及参考答案

设计四大阶段

单元产品开发流程及相关知识

软件园专业课软件工程填空-判断-简答题软备选

软件工程期末考试参考考试试题

第4章_结构化设计方法 参考答案

结构化程序设计语言

计算机程序设计员理论题

Swe5

《软件工程》练习题

软件工程试卷及答案

软件工程选择题汇总

软件工程题库

国家开放大学《软件工程》形考任务1、2、4参考答案

编导主要经历

编译原理有什么用

东南亚十大智慧城市

发表评论

推荐文章

javascript - scraping text with cheerio - Stack Overflow

Win7系统上的.NET Framework 3.5开发的程序不支持HTTPS TLS 1.2 的解决办法

How to skip the second part of (JavaScript if statement) if the first part is false - Stack Overflow

javascript - Test If The Array Index Equals The Array Value - Stack Overflow

javascript - JS: How can I prevent access to the global variables do? - Stack Overflow

热门文章

javascript - How to addtype a text in CKeditor (v4) in Cypress Automation?Or any Method to Set The Value for Ckeditor in Cypress

Scroll to certain height of a page javascript - Stack Overflow

javascript - (PERCY) Warning: skipping visual tests. PERCY_TOKEN was not provided - Stack Overflow

javascript - How to properly serialize query params? - Stack Overflow

dictionary - How to work with javascript Map without mutations - Stack Overflow

javascript - How to change background color using color picker without click on button? - Stack Overflow

javascript - Knockout JS - Multidimensional observableArrays and displaying sub-array data - Stack Overflow

javascript - Do Web Audio API events run in a separate thread? - Stack Overflow

Using external javascript script in TypeScript - Stack Overflow

关闭Win1011自动更新 ---【简单粗暴，亲测有用】

最新文章

javascript - How do I toggle the readonly attribute of all child element with jquery - Stack Overflow

javascript - Might it be possible to block an entire US state from accessing my site, using PHP? - Stack Overflow

c++ - Is dereferencing std::span::end always undefined? - Stack Overflow

javascript - Delay function execution if it has been called recently - Stack Overflow

javascript - Google Maps Autocomplete List - Stack Overflow

windows设置断电重启开机后自动输入锁屏密码登录

Windows系统设置开机默认开启数字小键盘

Windows11 开机自动同步时间（开机时间不更新问题）

windows配置开机自启动软件或脚本

【Redis】Windows设置Redis为开机自启动

Exploring the Finest Accommodations: A Comprehensive Guide to Ruston LA Hotels

The Enchanting Experience of ScaliniTella NYC: A Culinary Gem in the Heart of Manhattan

Exploring the Exquisite Aloft Chicago O'Hare: A Blend of Modern Luxury and Convenience

A Culinary Journey: Discovering the Finest Dining Experiences in Waco, TX

A Culinary Journey: Discovering the Finest Dining Experiences in Athens, GA

第4章_结构化设计方法参考答案