博客
关于我
Python每日一练(11)-爬取在线课程
阅读量:116 次
发布时间:2019-02-26

本文共 4776 字,大约阅读时间需要 15 分钟。

???????Python????

1. ?????Excel??

???????????????????????????Python????????????????????????????????????????????????????????Python?????????Python????????????????Excel????

????????????????????????????????????????????????????????????????????Python??????????????????????Python???????????Excel????????????????

2. ??requests??????

??????????????requests?????HTTP?????????????????????????????????

  • ????????????????????????????user-agent??????????
  • ??????????????????????????????????????????
  • ??JSON??????????JSON?????????json()???????

3. ??xlsxwriter????

????????????????Excel????????xlsxwriter?????????????????????xlsxwriter??????

  • ?????????????xlsxwriter???
pip install xlsxwriter
  • ???????????xlsxwriter???

  • ??Excel?????Workbook???Excel???????????Worksheet?

  • ???????write()???????Excel???????????Excel???????0???

4. ??????????

??????????????????

  • ?????????????????????????????????????????????
  • ??????????????????????????????????????????????????????????????????
  • ??????????????????????????????????????????????

5. ???????????

??????????????????????????????????????????????????????????????????????

6. ?????MySQL

????????Excel???????????????MySQL????????????????

  • ??????MySQL???????????????????????
  • ???????SQL???????????????????????????

7. ????

??????????????????requests?pymysql???????????Python???????????MySQL?????

import requestsimport timefrom multiprocessing import Poolfrom pymysql import *# ??????????MySQLdef get_json(index):    url = "https://study.163.com/p/search/studycourse.json"    payload = {        "pageSize": 50,        "pageIndex": index,        "relativeOffset": 0,        "searchTimeType": -1,        "orderType": 5,        "priceType": -1,        "activityId": 0,        "qualityType": 0,        "keyword": "python"    }    headers = {        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.100 Safari/537.36",        "accept": "application/json",        "content-type": "application/json",        "origin": "https://study.163.com"    }    response = requests.post(url, json=payload, headers=headers)    if response.status_code == 200:        content_json = response.json()        if content_json and content_json["message"] == "ok":            return content_json    return Nonedef get_content(content_json):    if "result" in content_json:        return content_json["result"]["list"]    return []def check_course_exit(course_id):    sql = f"select course_id from course where course_id = {course_id}"    cs1.execute(sql)    course = cs1.fetchone()    if course:        return True    else:        return Falsedef save_to_course(course_data):    sql_course = """insert into course                   values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)                   """    cs1.executemany(sql_course, course_data)def save_mysql(content):    course_data = []    for item in content:        if not check_course_exit(item['courseId']):            course_value = (                item['courseId'],                item['productId'],                item['productType'],                item['productName'],                item['provider'],                item['score'],                item['scoreLevel'],                item['learnerCount'],                item['lessonCount'],                item['lectorName'],                item['originalPrice'],                item['discountPrice'],                item['discountRate'],                item['imgUrl'],                item['bigImgUrl'],                item['description']            )            course_data.append(course_value)    save_to_course(course_data)def main(index):    content_json = get_json(index)    content = get_content(content_json)    save_mysql(content)if __name__ == '__main__':    conn = connect(host="localhost", port=3306, database="wyy_spider", user="root", password="mysql", charset="utf8")    cs1 = conn.cursor()    print("*******************????*******************")    start = time.time()    total_page_count = get_json(1)['result']["query"]["totlePageCount"]    pool = Pool()    index_list = [i for i in range(total_page_count)]    pool.map(main, index_list)    pool.close()    pool.join()    conn.commit()    cs1.close()    conn.close()    print("????")    end = time.time()    print(f"???????{end - start}?")    print("*******************????*******************")

8. ????

  • get_json???????HTTP?????JSON??????????????????????????????
  • get_content????JSON????????????
  • check_course_exit????????????????????
  • save_to_course?????????????????
  • save_mysql???????????????????save_to_course?????
  • main?????????????????????????????????

9. ????

???????????????????????MySQL????????????????course????????????

10. ??

???????????????????????????????Python????????Excel?MySQL???????????????????????????????????????????????????????????????????

转载地址:http://blvk.baihongyu.com/

你可能感兴趣的文章
OpenCV读写avi、mpeg文件
查看>>
opencv面向对象设计初探
查看>>
OpenCV(1)读写图像
查看>>
OpenCV:不规则形状区域中每种颜色的像素数?
查看>>
OpenCV:概念、历史、应用场景示例、核心模块、安装配置
查看>>
OpenDaylight融合OpenStack架构分析
查看>>
OpenERP ORM 对象方法列表
查看>>
openEuler Summit 2022 成功举行,开启全场景创新新时代
查看>>
openEuler 正式开放:推动计算多样化时代的到来
查看>>
OpenEuler23.03欧拉系统_安装瀚高数据库企业版6.0.4_openeuler切换root用户_su:拒绝权限_passwd: 鉴定令牌操作错误---国产瀚高数据库工作笔记001
查看>>
OpenEuler23.03欧拉系统_安装瀚高数据库企业版6.0.4_踩坑_安装以后系统无法联网_启动ens33网卡---国产瀚高数据库工作笔记002
查看>>
OpenFeign 入门与实战
查看>>
OpenFeign源码学习
查看>>
OpenFeign组件声明式服务调用
查看>>
openfeign远程调用不起作用解决_使用Spring Boot的spring.factories进行注入---SpringCloud Alibaba_若依微服务框架改造---工作笔记007
查看>>
openfire开发(四)消息拦截器
查看>>
openfire源码解读之将cache和session对象移入redis以提升性能
查看>>
Openfire身份认证绕过漏洞复现+利用(CVE-2023-32315)
查看>>
OpenForest 开源项目安装与使用指南
查看>>
opengl 深度详解,多重采样时,如何在OpenGL纹理中解析深度值?
查看>>