文章
位置: 首页 >文章
Python requests爬取中国知网关键词
- 爬虫
- 2021-06-21
- 1884
- 0
Python requests爬取中国知网关键词
主要使用python requests爬取中国知网
https://chkdx.cnki.net/kns8/#/mainWordDict
代码如下:
import os
import requests
import json
import pandas as pd
#实现将python中的数据写入excel中
def write_to_excel(result,path,sheet_name):
if not os.path.exists(path):
write = pd.ExcelWriter(path)
result.to_excel(write, sheet_name=sheet_name,index= False)
write.save()
else:
Original_df = pd.read_excel(path,sheet_name='result')
result = pd.concat([Original_df,result],axis=0,join='outer')
result = result.drop_duplicates()
write = pd.ExcelWriter(path)
result.to_excel(write, sheet_name=sheet_name,index= False)
write.save()
def crawl(page,start_page,path):
df = pd.DataFrame()
count = 0
# step为每批次爬取的页数,group_num是能够爬取完整step的组数
group_num = int(2465/page)
start_group = int(start_page/page)
# 保存获取失败的网页页码
failure_get = []
for group in range(start_group,group_num+1):
# 当能够获取完整step时,循环最大值设置为step
if group != group_num :
max = page+1
else:
# 当不能够获取完整的step时,循环最大值设置为剩余获取页数
max = 2465 - page * group_num
for m in range(1,max):
j = group * page + m
try:
print(f'正在爬取第{j}页')
url = f'https://mci.cnki.net//statistics/query?q=&start={j}&size=20'
User_Agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
cookies = 'cangjieConfig_CHKD2=%7B%22status%22%3Atrue%2C%22startTime%22%3A%222021-12-27%22%2C%22endTime%22%3A%222022-06-24%22%2C%22type%22%3A%22mix%22%2C%22poolSize%22%3A%2210%22%2C%22intervalTime%22%3A10000%2C%22persist%22%3Afalse%7D; SID=012051; Ecp_ClientId=d230612174601189092; Ecp_IpLoginFail=230612101.94.98.44'
headers_list = {'User-Agent': User_Agent, 'Cookie': cookies}
# 解决代理报错问题
proxies = {"http": None, "https": None}
issue_url_resp = requests.get(url=url, proxies=proxies, headers=headers_list)
for i in range(0,len(json.loads(issue_url_resp.text)['data']['data'])):
df.loc[count,'cn_title'] = json.loads(issue_url_resp.text)['data']['data'][i]['title']
df.loc[count, 'en_title'] = json.loads(issue_url_resp.text)['data']['data'][i]['entitle']
df.loc[count, 'url'] = json.loads(issue_url_resp.text)['data']['data'][i]['url']
count = count + 1
except:
print(f'第{j}页爬取失败!!!!!!!!!!!!')
failure_get.append(j)
write_to_excel(df, path, 'result')
print(failure_get)
if __name__ == '__main__':
path = r'test.xlsx'
# 爬取页数
page = 10;
# 开始页
start_page = 1
crawl(page,start_page,path)
如有其他问题请加群讨论
转载:欢迎来到本站,转载请注明文章出处https://www.ormcc.com ,欢迎加入技术讨论群599606903
上一篇:chatGPT不限次数,插件安装
下一篇:404效果
文章评论
- 0
- 0
- 0
- 分享
- 打赏
ormcc
一个爱捣鼓的程序员
- 160
会员 - 99
今日访问 - 607
文章
IP访问121932次,运行1481天
评论排行
文章归档
- 还没有相关文章