爬虫框架Scrapy实例-爬取小说

需求

  • 笔趣阁为例
  • 使用Spider爬取《圣墟》,保存到本地
  • 使用CrawlSpider爬取首页左上角推荐的几篇小说,保存到本地

新建项目

  • 安装pythonscrapy参考爬虫框架Scrapy笔记

  • 新建项目:scrapy startproject xbiquge

  • 修改xbiquge/xbiquge/settings.py文件

# ... USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36' ROBOTSTXT_OBEY = False DOWNLOAD_DELAY = 3 # ...
  • 修改xbiquge/xbiquge/items.py文件
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class XbiqugeItem(scrapy.Item): # define the fields for your item here like: # 小说名称 noval = scrapy.Field() # 章节名称 chapter = scrapy.Field() # 章节内容 content = scrapy.Field()
  • 修改xbiquge/xbiquge/pipelines.py,将爬取到的内容输入到txt文件中,按照小说区分(一个小说一个 txt 文件)
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html class XbiqugePipeline(object): def process_item(self, item, spider): # 一部小说存储到一个 txt 文件中 novel = '{}.txt'.format(item['noval']) with open(novel, 'w', encoding='utf-8') as f: f.write(item['chapter'] + '\n' + item['content'] + '\n\n') # 这种方法没有设置文件编码,打开可能会乱码 # self.file = open(novel, 'a') # self.file.write(item['chapter'] + '\n' + item['content'] + '\n\n') # self.file.close() # self.file.write(bytes(str(item), encoding='utf-8')) return item
  • 修改xbiquge/xbiquge/pipelines.py,将爬取到的内容输入到txt文件中,按照小说区分(一个小说一个目录,一个章节一个 txt 文件)
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html import os class XbiqugePipeline(object): def process_item(self, item, spider): # 一部小说放在一个目录下,每个章节保存成一个 txt 文件 curPath = '.' tempPath = str(item['noval']) targetPath = curPath + os.path.sep + tempPath if not os.path.exists(targetPath): os.makedirs(targetPath) filenamePath = targetPath + os.path.sep + str(item['chapter']) + '.txt' with open(filenamePath, 'w', encoding='utf-8') as f: f.write(item['content'] + '\n') return item
  • 修改xbiquge/xbiquge/settings.py文件,使XbiqugePipeline生效
ITEM_PIPELINES = { 'xbiquge.pipelines.XbiqugePipeline': 300, }

Spider

思路:

  • 要抓取的是一个完整的小说,那么就要先抓取每个章节的链接,然后在通过链接抓取小说的内容抓取的链接的页面
  • xbiquge/xbiquge/spiders/目录下新建爬虫文件singleSpiderscrapy genspider singleSpider xbiquge.la

  • 修改xbiquge/xbiquge/spiders/singleSpider.py文件

# -*- coding: utf-8 -*- import scrapy, re from xbiquge.items import XbiqugeItem class SinglespiderSpider(scrapy.Spider): name = 'singleSpider' allowed_domains = ['xbiquge.la'] start_urls = ['http://www.xbiquge.la/13/13959/'] def parse(self, response): listMain = response.xpath('//div[@id="list"]/dl/dd') for dd in listMain: item = XbiqugeItem() item['chapter'] = dd.xpath('./a/text()').extract_first() item['noval'] = '圣墟' partial_url = dd.xpath('./a/@href').extract_first() url = response.urljoin(partial_url) request = scrapy.Request(url=url, callback=self.parse_body) # Request中有一个参数是meta # 作用是将一个信息(任意的类型)传递给callback回调函数 # 当然传递使用字典的形式进行的, # meta={'key':item} (下面的request.meta【‘item’】可以这么写) # 这里为了将详情页的内容和主页的内容一起存储,用了meta方法对,主页的内容 # 进行了一个暂存,在后面进行统一的提交 request.meta['item'] = item yield request # 爬取详情页的方法 def parse_body(self, response): # 这是在回调函数中调用刚传进来的信息 item = response.meta['item'] content_list = response.xpath('.//div[@id="content"]').re('([\u4e00-\u9fa5]|<br>)+?') # 利用re直接匹配小说的汉字内容.正则可以匹配标签下的任何内容,这样我们可以提取我们想要的数据 content_str = ''.join(content_list) content = re.sub(r'<br\/?>', '\n', content_str) item['content'] = content yield item
  • 运行:scrapy crawl singleSpider

CrawlSpider

  • xbiquge/xbiquge/spiders/目录下新建爬虫文件crawlSpiderscrapy genspider -t crawl crawlSpider xbiquge.la

  • 修改xbiquge/xbiquge/spiders/crawlSpider.py文件

# -*- coding: utf-8 -*- import scrapy, re from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from xbiquge.items import XbiqugeItem class CrawlspiderSpider(CrawlSpider): name = 'crawlSpider' allowed_domains = ['xbiquge.la'] start_urls = ['http://www.xbiquge.la/'] rules = ( Rule(LinkExtractor(allow='/\d+?\/\d+?/',restrict_xpaths='//*[@id="hotcontent"]/div[1]',unique=True),callback='parse_item',follow=False), # 注意使用restricted_xpath提取链接的时候只能到标签就好了,千万不要到具体的标签属性。 ) def parse_item(self, response): listMain = response.xpath('//div[@id="list"]/dl/dd') for dd in listMain: item = XbiqugeItem() item['chapter'] = dd.xpath('./a/text()').extract_first() item['noval'] = response.xpath('//div[@id="info"]/h1/text()').extract_first() partial_url = dd.xpath('./a/@href').extract_first() url = response.urljoin(partial_url) request = scrapy.Request(url=url, callback=self.parse_body) request.meta['item'] = item yield request # 爬取详情页的方法 def parse_body(self, response): # 这是在回调函数中调用刚传进来的信息 item = response.meta['item'] content_list = response.xpath('.//div[@id="content"]').re('([\u4e00-\u9fa5]|<br>)+?') # 利用re直接匹配小说的汉字内容.正则可以匹配标签下的任何内容,这样我们可以提取我们想要的数据 content_str = ''.join(content_list) content = re.sub(r'<br\/?>', '\n', content_str) item['content'] = content yield item
  • 运行:scrapy crawl crawlSpider

创作不易,若本文对你有帮助,欢迎打赏支持作者!

 分享给好友: