crm开发定制【Python】【爬虫】爬取小说5000章,遇到的爬虫问题与解决思路

问题分析

回顾

crm开发定制之前写了一个爬取小说crm开发定制网站的爬虫,crm开发定制操作流程如下:

crm开发定制先爬取小说介绍页,crm开发定制获取所有章节信息(章节名称,crm开发定制章节对应阅读链接),crm开发定制然后使用多线程的方式(pool = Pool(50)),crm开发定制通过章节的阅读链接爬crm开发定制取章节正文并保存为本地markdown文件。(crm开发定制代码见文末 run01.python)

爬取100章,用了10秒

限制爬取101章,crm开发定制从运行程序到结束程序,用时9秒

Redis+MongoDB,无多线程

最近学了Redis和MongoDB,crm开发定制要求爬取后将章节链接放在redis,crm开发定制然后通过读取redis的章节链接来进行爬取。(代码见文末run02.python)

…不用测试了,一章一章读真的太慢了!

爬取101章用时两分钟!

Redis+MongoDB+多线程

爬取101章,只需8秒!

爬取4012章,用时1分10秒!

问题与解析

懒得打字,我就录成视频发在小破站上面了。(小破站搜:萌狼蓝天)

其他的去我小破站主页翻

代码20221020

run01.py

  1. # -*- coding: UTF-8 -*-
  2. # 开发人员:萌狼蓝天
  3. # 博客:Https://mllt.cc
  4. # 笔记:Https://cnblogs.com/mllt
  5. # 哔哩哔哩/微信公众号:萌狼蓝天
  6. # 开发时间:2022/9/28
  7. # https://www.lingdianksw8.com/31/31596/
  8. import datetime
  9. import re
  10. import random
  11. from multiprocessing import Pool
  12. import requests
  13. import bs4
  14. import os
  15. os.environ['NO_PROXY'] = "www.lingdianksw8.com"
  16. def Log_text(lx="info", *text):
  17. lx.upper()
  18. with open("log.log", "a+", encoding="utf-8") as f:
  19. f.write("[" + str(datetime.datetime.now()) + "]" + "[" + lx + "]")
  20. for i in text:
  21. f.write(i)
  22. f.close()
  23. # 调试输出
  24. def log(message, i="info"):
  25. if type(message) == type(""):
  26. i.upper()
  27. print("[", i, "] [", str(type(message)), "]", message)
  28. elif type(message) == type([]):
  29. count = 0
  30. for j in message:
  31. print("[", i, "] [", str(count), "] [", str(type(message)), "]", j)
  32. count += 1
  33. else:
  34. print("[", i, "] [", str(type(message)), "]", end=" ")
  35. print(message)
  36. # 获取源码
  37. def getCode(url, methods="post"):
  38. """
  39. 获取页面源码
  40. :param methods: 请求提交方式
  41. :param url:书籍首页链接
  42. :return:页面源码
  43. """
  44. # 设置请求头
  45. user_agent = [
  46. "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
  47. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
  48. "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
  49. "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
  50. "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
  51. "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
  52. "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
  53. "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
  54. "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
  55. "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
  56. "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
  57. "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
  58. "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
  59. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
  60. "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
  61. "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
  62. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
  63. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
  64. "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
  65. "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
  66. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",
  67. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
  68. "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
  69. "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
  70. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
  71. "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
  72. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
  73. "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
  74. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
  75. "Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",
  76. "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",
  77. "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",
  78. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
  79. "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10"
  80. ]
  81. headers = {
  82. 'User-Agent': random.choice(user_agent),
  83. # "user-agent": user_agent[random.randint(0, len(user_agent) - 1)]
  84. }
  85. # 获取页面源码
  86. result = requests.request(methods, url, headers=headers, allow_redirects=True)
  87. log("cookie" + str(result.cookies.values()))
  88. tag = 0
  89. log("初始页面编码为:" + result.encoding)
  90. if result.encoding != "gbk":
  91. log("初始页面编码非gbk,需要进行重编码操作", "warn")
  92. tag = 1
  93. try:
  94. result = requests.request(methods, url, headers=headers, allow_redirects=True, cookies=result.cookies)
  95. except:
  96. return "InternetError"
  97. result_text = result.text
  98. # print(result_text)
  99. if tag == 1:
  100. result_text = recoding(result)
  101. log("转码编码完成,当前编码为gbk")
  102. return result_text
  103. def recoding(result):
  104. try:
  105. result_text = result.content.decode("gbk",errors='ignore')
  106. except:
  107. # UnicodeDecodeError: 'gbk' codec can't decode byte 0xae in position 6917:
  108. try:
  109. result_text = result.content.decode("").encode("unicode_escape").decode("gbk",errors='ignore')
  110. except:
  111. try:
  112. result_text = result.content.decode("gb18030",errors='ignore')
  113. except:
  114. result_text = result.text
  115. return result_text
  116. # 分析数据
  117. def getDict(code):
  118. """
  119. 分析网页源码,获取数据,并存储为以字典元素构成的列表返回
  120. :param code:网页源码
  121. :return:List
  122. """
  123. # 通过正则的方式缩小范围
  124. code = re.findall("正文卷</dt>(.*?)</dl>", code, re.S)[0]
  125. # log(code)
  126. # obj = bs4.BeautifulSoup(markup=code,features="html.parser")
  127. obj = bs4.BeautifulSoup(markup=code, features="lxml")
  128. # log("输出结果")
  129. # log(obj.find_all("a"))
  130. # 通过上面调试输出可知得到的是个列表
  131. tag = obj.find_all("a")
  132. log("tag长度为:" + str(len(tag)))
  133. result = []
  134. count = 0
  135. for i in range(len(tag)):
  136. count += 1
  137. link = tag[i]["href"]
  138. text = tag[i].get_text()
  139. result.append({"title": text, "link": "https://www.lingdianksw8.com" + link})
  140. return result
  141. # 文章内容
  142. def getContent(url):
  143. code = getCode(url, "get")
  144. if code=="InternetError":
  145. return "InternetError",""
  146. try:
  147. code = code.replace("<br />", "\")
  148. code = code.replace("&nbsp;", " ")
  149. code = code.replace(" ", " ")
  150. except Exception as e:
  151. # AttributeError: 'tuple' object has no attribute 'replace'
  152. Log_text("error","[run01-161~163]"+str(e))
  153. # with open("temp.txt","w+",encoding="utf-8") as f:
  154. # f.write(code)
  155. obj = bs4.BeautifulSoup(markup=code, features="lxml")
  156. titile = obj.find_all("h1")[0].text
  157. try:
  158. content = obj.find_all("div", attrs={"class": "showtxt"})[0].text
  159. except:
  160. return None, None
  161. # with open("temp.txt", "w+", encoding="utf-8") as f:
  162. # f.write(content)
  163. # log(content)
  164. try:
  165. g = re.findall(
  166. "(:.*?https://www.lingdianksw8.com.*?天才一秒记住本站地址:www.lingdianksw8.com。零点看书手机版阅读网址:.*?.com)",
  167. content, re.S)[0]
  168. log(g)
  169. content = content.replace(g, "")
  170. except:
  171. Log_text("error", "清除广告失败!章节" + titile + "(" + url + ")")
  172. log(content)
  173. return titile, content
  174. def docToMd(name, title, content):
  175. with open(name + ".md", "w+", encoding="utf-8") as f:
  176. f.write("## " + title + "/n" + content)
  177. f.close()
  178. return 0
  179. # 多线程专供函数 - 通过链接获取文章
  180. def thead_getContent(link):
  181. # 根据链接获取文章内容
  182. Log_text("info", "尝试获取" + str(link))
  183. title, content = getContent(str(link)) # 从文章内获取到标题和内容
  184. Log_text("success", "获取章节" + title + "完成")
  185. docToMd(title, title, content)
  186. Log_text("success", "写出章节" + title + "完成")
  187. # 操作汇总
  188. def run(url):
  189. with open("log1.log", "w+", encoding="utf-8") as f:
  190. f.write("")
  191. f.close()
  192. Log_text("info", "开始获取小说首页...")
  193. code = getCode(url)
  194. Log_text("success", "获取小说首页源代码完成,开始分析...")
  195. index = getDict(code) # 获取到[{章节名称title:链接link}]
  196. links = []
  197. # lineCount限制要爬取的数量
  198. lineCount = 0
  199. for i in index:
  200. if lineCount > 10:
  201. break
  202. lineCount += 1
  203. links.append(i["link"])
  204. print("链接状态")
  205. print(type(links))
  206. print(links)
  207. Log_text("success", "分析小说首页完成,数据整理完毕,开始获取小说内容...")
  208. pool = Pool(50) # 多线程
  209. pool.map(thead_getContent, links)
  210. if __name__ == '__main__':
  211. start = datetime.datetime.today()
  212. Log_text("===【日志】[多线程-]开始新的测试 =|=|=|= " + str(start))
  213. run(r"https://www.lingdianksw8.com/31/31596")
  214. # getContent("http://www.lingdianksw8.com/31/31596/8403973.html")
  215. end = datetime.datetime.today()
  216. Log_text("===【日志】[多线程]测试结束 =|=|=|= " + str(end))
  217. Log_text("===【日志】[多线程]测试结束 =|=|=|= 用时" + str(end - start))
  218. print("")

run02.py

  1. # -*- coding: UTF-8 -*-
  2. # 开发人员:萌狼蓝天
  3. # 博客:Https://mllt.cc
  4. # 笔记:Https://cnblogs.com/mllt
  5. # 哔哩哔哩/微信公众号:萌狼蓝天
  6. # 开发时间:2022/9/28
  7. # https://www.lingdianksw8.com/31/31596/
  8. """
  9. 1.通过run01获取章节的链接,将链接存储到Redis中
  10. 2.从Redis获取章节链接并爬取
  11. """
  12. import re
  13. import pymongo
  14. from lxml import html
  15. import run01 as xrilang
  16. import redis
  17. import datetime
  18. client = redis.StrictRedis()
  19. def getLinks():
  20. xrilang.Log_text("===【日志】开始获取章节名称和链接")
  21. code = xrilang.getCode("https://www.lingdianksw8.com/61153/61153348/","get")
  22. source = re.findall("正文卷</dt>(.*?)</dl>", code, re.S)[0]
  23. selector = html.fromstring(source)
  24. title_list = selector.xpath("//dd/a/text()")
  25. url_list = selector.xpath("//dd/a/@href")
  26. client.flushall() # 清空Redis全部内容,避免重复运行造成的数据重复
  27. xrilang.Log_text("===【日志】开始获取标题")
  28. for title in title_list:
  29. xrilang.log(title)
  30. client.lpush('title_queue', title)
  31. xrilang.Log_text("===【日志】开始获取章节链接")
  32. for url in url_list:
  33. xrilang.log(url)
  34. client.lpush('url_queue', url)
  35. xrilang.log(client.llen('url_queue'))
  36. xrilang.Log_text("===【日志】获取章节链接结束,共"+str(client.llen('url_queue'))+"条")
  37. def getContent():
  38. xrilang.Log_text("===【日志】开始获取章节内容")
  39. database = pymongo.MongoClient()['book']
  40. collection = database['myWifeSoBeautifull']
  41. startTime=datetime.datetime.today()
  42. xrilang.log("开始"+str(startTime))
  43. linkCount=0
  44. datas=[]
  45. while client.llen("url_queue")>0:
  46. # 爬取101章
  47. if linkCount >10:
  48. break
  49. linkCount += 1
  50. url = client.lpop("url_queue").decode()
  51. title = client.lpop("title_queue").decode()
  52. xrilang.log(url)
  53. # 获取文章内容并保存到数据库
  54. content_url = "https://www.lingdianksw8.com"+url
  55. name,content = xrilang.getContent(content_url)
  56. if name!=None and content!=None:
  57. datas.append({"title":title,"name":name,"content":content})
  58. collection.insert_many(datas)
  59. if __name__ == '__main__':
  60. start = datetime.datetime.today()
  61. xrilang.Log_text("===【日志】[redis+MongoDB无多线程]开始新的测试 =|=|=|= " + str(start))
  62. getLinks()
  63. getContent()
  64. end = datetime.datetime.today()
  65. xrilang.Log_text("===【日志】[redis+MongoDB无多线程]测试结束 =|=|=|= " + str(end))
  66. xrilang.Log_text("===【日志】[redis+MongoDB无多线程]测试结束 =|=|=|= 用时" + str(end-start))
  67. print("")

run03.py

  1. # -*- coding: UTF-8 -*-
  2. # 开发人员:萌狼蓝天
  3. # 博客:Https://mllt.cc
  4. # 笔记:Https://cnblogs.com/mllt
  5. # 哔哩哔哩/微信公众号:萌狼蓝天
  6. # 开发时间:2022/9/28
  7. # https://www.lingdianksw8.com/31/31596/
  8. """
  9. 1.通过run01获取章节的链接,将链接存储到Redis中
  10. 2.从Redis获取章节链接并爬取
  11. """
  12. import re
  13. import time
  14. from multiprocessing.dummy import Pool
  15. import pymongo
  16. from lxml import html
  17. import run01 as xrilang
  18. import redis
  19. import datetime
  20. client = redis.StrictRedis()
  21. database = pymongo.MongoClient()['book']
  22. collection = database['myWifeSoBeautifull']
  23. def getLinks():
  24. xrilang.Log_text("===【日志】开始获取章节名称和链接")
  25. code = xrilang.getCode("https://www.lingdianksw8.com/61153/61153348/","get")
  26. source = re.findall("正文卷</dt>(.*?)</dl>", code, re.S)[0]
  27. selector = html.fromstring(source)
  28. url_list = selector.xpath("//dd/a/@href")
  29. client.flushall() # 清空Redis全部内容,避免重复运行造成的数据重复
  30. xrilang.Log_text("===【日志】开始获取章节链接")
  31. i=0
  32. for url in url_list:
  33. xrilang.log(url)
  34. client.lpush('url_queue', url)
  35. i+=1
  36. client.lpush('sort_queue', i) # 解决多线程爬虫导致的顺序问题
  37. xrilang.log(client.llen('url_queue'))
  38. xrilang.Log_text("===【日志】获取章节链接结束,共"+str(client.llen('url_queue'))+"条")
  39. def getContent(durl):
  40. url = durl["url"]
  41. isort=durl["isort"]
  42. content_url = "https://www.lingdianksw8.com" + url
  43. title, content = xrilang.getContent(content_url)
  44. if title != "InternetError":
  45. if title != None and content != None:
  46. xrilang.log("获取"+title+"成功")
  47. collection.insert_one({"isort":isort,"title": title, "content": content})
  48. else:
  49. # 没有成功爬取的添加回redis,等待下次爬取
  50. client.lpush('url_queue', url)
  51. client.lpush('sort_queue', isort) # 解决多线程爬虫导致的顺序问题
  52. # 等待5秒
  53. time.sleep(1000)
  54. else:
  55. # 没有成功爬取的添加回redis,等待下次爬取
  56. client.lpush('url_queue', url)
  57. client.lpush('sort_queue', isort) # 解决多线程爬虫导致的顺序问题
  58. # 等待5秒
  59. time.sleep(5000)
  60. def StartGetContent():
  61. xrilang.Log_text("===【日志】开始获取章节内容")
  62. startTime = datetime.datetime.today()
  63. xrilang.log("开始"+str(startTime))
  64. urls=[]
  65. # xrilang.log(client.llen("url_queue"))
  66. while client.llen("url_queue")>0:
  67. url = client.lpop("url_queue").decode()
  68. isort= client.lpop("sort_queue").decode()
  69. #urls.append(url)
  70. urls.append({"url":url,"isort":isort})
  71. # xrilang.log(urls)
  72. pool = Pool(500) # 多线程
  73. pool.map(getContent,urls)
  74. endTime=datetime.datetime.today()
  75. xrilang.log("【结束】"+str(endTime))
  76. xrilang.Log_text("===【日志】开始获取章节结束,用时"+str(endTime-startTime))
  77. if __name__ == '__main__':
  78. start = datetime.datetime.today()
  79. xrilang.Log_text("===【日志】[redis+MongoDB+多线程]开始新的测试 =|=|=|= " + str(start))
  80. getLinks()
  81. StartGetContent()
  82. end = datetime.datetime.today()
  83. xrilang.Log_text("===【日志】[redis+MongoDB+多线程]测试结束 =|=|=|= " + str(end))
  84. xrilang.Log_text("===【日志】[redis+MongoDB+多线程]测试结束 =|=|=|= 用时" + str(end-start))
  85. print("")

mongoQ.py

  1. # -*- coding: UTF-8 -*-
  2. # 开发人员:萌狼蓝天
  3. # 博客:Https://mllt.cc
  4. # 笔记:Https://cnblogs.com/mllt
  5. # 哔哩哔哩/微信公众号:萌狼蓝天
  6. # 开发时间:2022/10/20
  7. import pymongo
  8. database = pymongo.MongoClient()['book']
  9. collection = database['myWifeSoBeautifull']
  10. result = collection.find().collation({"locale":"zh", "numericOrdering":True}).sort("isort")
  11. with open("list.txt", "a+", encoding="utf-8") as f:
  12. for i in result:
  13. f.writelines(i["isort"]+" "+i["title"]+"\")

代码20221019

run01.py

  1. # -*- coding: UTF-8 -*-
  2. # 开发人员:萌狼蓝天
  3. # 博客:Https://mllt.cc
  4. # 笔记:Https://cnblogs.com/mllt
  5. # 哔哩哔哩/微信公众号:萌狼蓝天
  6. # 开发时间:2022/9/28
  7. # https://www.lingdianksw8.com/31/31596/
  8. import datetime
  9. import re
  10. import random
  11. from multiprocessing import Pool
  12. import requests
  13. import bs4
  14. import os
  15. os.environ['NO_PROXY'] = "www.lingdianksw8.com"
  16. def Log_text(lx="info", *text):
  17. lx.upper()
  18. with open("log.log", "a+", encoding="utf-8") as f:
  19. f.write("[" + str(datetime.datetime.now()) + "]" + "[" + lx + "]")
  20. for i in text:
  21. f.write(i)
  22. f.close()
  23. # 调试输出
  24. def log(message, i="info"):
  25. if type(message) == type(""):
  26. i.upper()
  27. print("[", i, "] [", str(type(message)), "]", message)
  28. elif type(message) == type([]):
  29. count = 0
  30. for j in message:
  31. print("[", i, "] [", str(count), "] [", str(type(message)), "]", j)
  32. count += 1
  33. else:
  34. print("[", i, "] [", str(type(message)), "]", end=" ")
  35. print(message)
  36. # 获取源码
  37. def getCode(url, methods="post"):
  38. """
  39. 获取页面源码
  40. :param methods: 请求提交方式
  41. :param url:书籍首页链接
  42. :return:页面源码
  43. """
  44. # 设置请求头
  45. user_agent = [
  46. "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
  47. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
  48. "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
  49. "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
  50. "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
  51. "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
  52. "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
  53. "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
  54. "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
  55. "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
  56. "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
  57. "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
  58. "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
  59. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
  60. "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
  61. "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
  62. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
  63. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
  64. "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
  65. "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
  66. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",
  67. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
  68. "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
  69. "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
  70. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
  71. "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
  72. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
  73. "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
  74. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
  75. "Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",
  76. "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",
  77. "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",
  78. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
  79. "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10"
  80. ]
  81. headers = {
  82. 'User-Agent': random.choice(user_agent),
  83. # "user-agent": user_agent[random.randint(0, len(user_agent) - 1)]
  84. }
  85. # 获取页面源码
  86. result = requests.request(methods, url, headers=headers, allow_redirects=True)
  87. log("cookie" + str(result.cookies.values()))
  88. tag = 0
  89. log("初始页面编码为:" + result.encoding)
  90. if result.encoding == "gbk" or result.encoding == "ISO-8859-1":
  91. log("初始页面编码非UTF-8,需要进行重编码操作", "warn")
  92. tag = 1
  93. try:
  94. result = requests.request(methods, url, headers=headers, allow_redirects=True, cookies=result.cookies)
  95. except:
  96. return "InternetError",""
  97. result_text = result.text
  98. # print(result_text)
  99. if tag == 1:
  100. result_text = recoding(result)
  101. log("转码编码完成,当前编码为gbk")
  102. return result_text
  103. def recoding(result):
  104. try:
  105. result_text = result.content.decode("gbk",errors='ignore')
  106. except:
  107. # UnicodeDecodeError: 'gbk' codec can't decode byte 0xae in position 6917:
  108. try:
  109. result_text = result.content.decode("").encode("unicode_escape").decode("gbk",errors='ignore')
  110. except:
  111. try:
  112. result_text = result.content.decode("gb18030",errors='ignore')
  113. except:
  114. result_text = result.text
  115. return result_text
  116. # 分析数据
  117. def getDict(code):
  118. """
  119. 分析网页源码,获取数据,并存储为以字典元素构成的列表返回
  120. :param code:网页源码
  121. :return:List
  122. """
  123. # 通过正则的方式缩小范围
  124. code = re.findall("正文卷</dt>(.*?)</dl>", code, re.S)[0]
  125. # log(code)
  126. # obj = bs4.BeautifulSoup(markup=code,features="html.parser")
  127. obj = bs4.BeautifulSoup(markup=code, features="lxml")
  128. # log("输出结果")
  129. # log(obj.find_all("a"))
  130. # 通过上面调试输出可知得到的是个列表
  131. tag = obj.find_all("a")
  132. log("tag长度为:" + str(len(tag)))
  133. result = []
  134. count = 0
  135. for i in range(len(tag)):
  136. count += 1
  137. link = tag[i]["href"]
  138. text = tag[i].get_text()
  139. result.append({"title": text, "link": "https://www.lingdianksw8.com" + link})
  140. return result
  141. # 文章内容
  142. def getContent(url):
  143. code = getCode(url, "get")
  144. try:
  145. code = code.replace("<br />", "\")
  146. code = code.replace("&nbsp;", " ")
  147. code = code.replace(" ", " ")
  148. except Exception as e:
  149. # AttributeError: 'tuple' object has no attribute 'replace'
  150. Log_text("error","[run01-161~163]"+str(e))
  151. # with open("temp.txt","w+",encoding="utf-8") as f:
  152. # f.write(code)
  153. obj = bs4.BeautifulSoup(markup=code, features="lxml")
  154. titile = obj.find_all("h1")[0].text
  155. try:
  156. content = obj.find_all("div", attrs={"class": "showtxt"})[0].text
  157. except:
  158. return None, None
  159. # with open("temp.txt", "w+", encoding="utf-8") as f:
  160. # f.write(content)
  161. # log(content)
  162. try:
  163. g = re.findall(
  164. "(:.*?https://www.lingdianksw8.com.*?天才一秒记住本站地址:www.lingdianksw8.com。零点看书手机版阅读网址:.*?.com)",
  165. content, re.S)[0]
  166. log(g)
  167. content = content.replace(g, "")
  168. except:
  169. Log_text("error", "清除广告失败!章节" + titile + "(" + url + ")")
  170. log(content)
  171. return titile, content
  172. def docToMd(name, title, content):
  173. with open(name + ".md", "w+", encoding="utf-8") as f:
  174. f.write("## " + title + "/n" + content)
  175. f.close()
  176. return 0
  177. # 多线程专供函数 - 通过链接获取文章
  178. def thead_getContent(link):
  179. # 根据链接获取文章内容
  180. Log_text("info", "尝试获取" + str(link))
  181. title, content = getContent(str(link)) # 从文章内获取到标题和内容
  182. Log_text("success", "获取章节" + title + "完成")
  183. docToMd(title, title, content)
  184. Log_text("success", "写出章节" + title + "完成")
  185. # 操作汇总
  186. def run(url):
  187. with open("log1.log", "w+", encoding="utf-8") as f:
  188. f.write("")
  189. f.close()
  190. Log_text("info", "开始获取小说首页...")
  191. code = getCode(url)
  192. Log_text("success", "获取小说首页源代码完成,开始分析...")
  193. index = getDict(code) # 获取到[{章节名称title:链接link}]
  194. links = []
  195. # lineCount限制要爬取的数量
  196. lineCount = 0
  197. for i in index:
  198. if lineCount > 100:
  199. break
  200. lineCount += 1
  201. links.append(i["link"])
  202. print("链接状态")
  203. print(type(links))
  204. print(links)
  205. Log_text("success", "分析小说首页完成,数据整理完毕,开始获取小说内容...")
  206. pool = Pool(50) # 多线程
  207. pool.map(thead_getContent, links)
  208. if __name__ == '__main__':
  209. start = datetime.datetime.today()
  210. Log_text("===【日志】[多线程-]开始新的测试 =|=|=|= " + str(start))
  211. run(r"https://www.lingdianksw8.com/31/31596")
  212. # getContent("http://www.lingdianksw8.com/31/31596/8403973.html")
  213. end = datetime.datetime.today()
  214. Log_text("===【日志】[多线程]测试结束 =|=|=|= " + str(end))
  215. Log_text("===【日志】[多线程]测试结束 =|=|=|= 用时" + str(end - start))
  216. print("")

run02.py

  1. # -*- coding: UTF-8 -*-
  2. # 开发人员:萌狼蓝天
  3. # 博客:Https://mllt.cc
  4. # 笔记:Https://cnblogs.com/mllt
  5. # 哔哩哔哩/微信公众号:萌狼蓝天
  6. # 开发时间:2022/9/28
  7. # https://www.lingdianksw8.com/31/31596/
  8. """
  9. 1.通过run01获取章节的链接,将链接存储到Redis中
  10. 2.从Redis获取章节链接并爬取
  11. """
  12. import re
  13. import pymongo
  14. from lxml import html
  15. import run01 as xrilang
  16. import redis
  17. import datetime
  18. client = redis.StrictRedis()
  19. def getLinks():
  20. xrilang.Log_text("===【日志】开始获取章节名称和链接")
  21. code = xrilang.getCode("https://www.lingdianksw8.com/61153/61153348/","get")
  22. source = re.findall("正文卷</dt>(.*?)</dl>", code, re.S)[0]
  23. selector = html.fromstring(source)
  24. title_list = selector.xpath("//dd/a/text()")
  25. url_list = selector.xpath("//dd/a/@href")
  26. client.flushall() # 清空Redis全部内容,避免重复运行造成的数据重复
  27. xrilang.Log_text("===【日志】开始获取标题")
  28. for title in title_list:
  29. xrilang.log(title)
  30. client.lpush('title_queue', title)
  31. xrilang.Log_text("===【日志】开始获取章节链接")
  32. for url in url_list:
  33. xrilang.log(url)
  34. client.lpush('url_queue', url)
  35. xrilang.log(client.llen('url_queue'))
  36. xrilang.Log_text("===【日志】获取章节链接结束,共"+str(client.llen('url_queue'))+"条")
  37. def getContent():
  38. xrilang.Log_text("===【日志】开始获取章节内容")
  39. database = pymongo.MongoClient()['book']
  40. collection = database['myWifeSoBeautifull']
  41. startTime=datetime.datetime.today()
  42. xrilang.log("开始"+str(startTime))
  43. linkCount=0
  44. datas=[]
  45. while client.llen("url_queue")>0:
  46. # 爬取101章
  47. if linkCount >10:
  48. break
  49. linkCount += 1
  50. url = client.lpop("url_queue").decode()
  51. title = client.lpop("title_queue").decode()
  52. xrilang.log(url)
  53. # 获取文章内容并保存到数据库
  54. content_url = "https://www.lingdianksw8.com"+url
  55. name,content = xrilang.getContent(content_url)
  56. if name!=None and content!=None:
  57. datas.append({"title":title,"name":name,"content":content})
  58. collection.insert_many(datas)
  59. if __name__ == '__main__':
  60. start = datetime.datetime.today()
  61. xrilang.Log_text("===【日志】[redis+MongoDB无多线程]开始新的测试 =|=|=|= " + str(start))
  62. getLinks()
  63. getContent()
  64. end = datetime.datetime.today()
  65. xrilang.Log_text("===【日志】[redis+MongoDB无多线程]测试结束 =|=|=|= " + str(end))
  66. xrilang.Log_text("===【日志】[redis+MongoDB无多线程]测试结束 =|=|=|= 用时" + str(end-start))
  67. print("")

run03.py

  1. # -*- coding: UTF-8 -*-
  2. # 开发人员:萌狼蓝天
  3. # 博客:Https://mllt.cc
  4. # 笔记:Https://cnblogs.com/mllt
  5. # 哔哩哔哩/微信公众号:萌狼蓝天
  6. # 开发时间:2022/9/28
  7. # https://www.lingdianksw8.com/31/31596/
  8. """
  9. 1.通过run01获取章节的链接,将链接存储到Redis中
  10. 2.从Redis获取章节链接并爬取
  11. """
  12. import re
  13. import time
  14. from multiprocessing.dummy import Pool
  15. import pymongo
  16. from lxml import html
  17. import run01 as xrilang
  18. import redis
  19. import datetime
  20. client = redis.StrictRedis()
  21. database = pymongo.MongoClient()['book']
  22. collection = database['myWifeSoBeautifull']
  23. def getLinks():
  24. xrilang.Log_text("===【日志】开始获取章节名称和链接")
  25. code = xrilang.getCode("https://www.lingdianksw8.com/61153/61153348/","get")
  26. source = re.findall("正文卷</dt>(.*?)</dl>", code, re.S)[0]
  27. selector = html.fromstring(source)
  28. url_list = selector.xpath("//dd/a/@href")
  29. client.flushall() # 清空Redis全部内容,避免重复运行造成的数据重复
  30. xrilang.Log_text("===【日志】开始获取章节链接")
  31. i=0
  32. for url in url_list:
  33. xrilang.log(url)
  34. client.lpush('url_queue', url)
  35. i+=1
  36. client.lpush('sort_queue', i) # 解决多线程爬虫导致的顺序问题
  37. xrilang.log(client.llen('url_queue'))
  38. xrilang.Log_text("===【日志】获取章节链接结束,共"+str(client.llen('url_queue'))+"条")
  39. def getContent(durl):
  40. url = durl["url"]
  41. isort=durl["isort"]
  42. content_url = "https://www.lingdianksw8.com" + url
  43. title, content = xrilang.getContent(content_url)
  44. if title != None and content != None:
  45. if (title != "InternetError"):
  46. xrilang.log("获取"+title+"成功")
  47. collection.insert_one({"isort":isort,"title": title, "content": content})
  48. else:
  49. # 没有成功爬取的添加回redis,等待下次爬取
  50. client.lpush('url_queue', url)
  51. client.lpush('sort_queue', isort) # 解决多线程爬虫导致的顺序问题
  52. # 等待5秒
  53. time.sleep(5000)
  54. def StartGetContent():
  55. xrilang.Log_text("===【日志】开始获取章节内容")
  56. startTime = datetime.datetime.today()
  57. xrilang.log("开始"+str(startTime))
  58. urls=[]
  59. # xrilang.log(client.llen("url_queue"))
  60. while client.llen("url_queue")>0:
  61. url = client.lpop("url_queue").decode()
  62. isort= client.lpop("sort_queue").decode()
  63. #urls.append(url)
  64. urls.append({"url":url,"isort":isort})
  65. # xrilang.log(urls)
  66. pool = Pool(500) # 多线程
  67. pool.map(getContent,urls)
  68. endTime=datetime.datetime.today()
  69. xrilang.log("【结束】"+str(endTime))
  70. xrilang.Log_text("===【日志】开始获取章节结束,用时"+str(endTime-startTime))
  71. if __name__ == '__main__':
  72. start = datetime.datetime.today()
  73. xrilang.Log_text("===【日志】[redis+MongoDB+多线程]开始新的测试 =|=|=|= " + str(start))
  74. getLinks()
  75. StartGetContent()
  76. end = datetime.datetime.today()
  77. xrilang.Log_text("===【日志】[redis+MongoDB+多线程]测试结束 =|=|=|= " + str(end))
  78. xrilang.Log_text("===【日志】[redis+MongoDB+多线程]测试结束 =|=|=|= 用时" + str(end-start))
  79. print("")
网站建设定制开发 软件系统开发定制 定制软件开发 软件开发定制 定制app开发 app开发定制 app开发定制公司 电商商城定制开发 定制小程序开发 定制开发小程序 客户管理系统开发定制 定制网站 定制开发 crm开发定制 开发公司 小程序开发定制 定制软件 收款定制开发 企业网站定制开发 定制化开发 android系统定制开发 定制小程序开发费用 定制设计 专注app软件定制开发 软件开发定制定制 知名网站建设定制 软件定制开发供应商 应用系统定制开发 软件系统定制开发 企业管理系统定制开发 系统定制开发