需求
- 以笔趣阁为例
- 使用
Spider
爬取《圣墟》,保存到本地 - 使用
CrawlSpider
爬取首页左上角推荐的几篇小说,保存到本地
新建项目
-
安装
python
和scrapy
参考爬虫框架Scrapy笔记 -
新建项目:
scrapy startproject xbiquge
-
修改
xbiquge/xbiquge/settings.py
文件
# ...
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3
# ...
- 修改
xbiquge/xbiquge/items.py
文件
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class XbiqugeItem(scrapy.Item):
# define the fields for your item here like:
# 小说名称
noval = scrapy.Field()
# 章节名称
chapter = scrapy.Field()
# 章节内容
content = scrapy.Field()
- 修改
xbiquge/xbiquge/pipelines.py
,将爬取到的内容输入到txt
文件中,按照小说区分(一个小说一个txt
文件)
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
class XbiqugePipeline(object):
def process_item(self, item, spider):
# 一部小说存储到一个 txt 文件中
novel = '{}.txt'.format(item['noval'])
with open(novel, 'w', encoding='utf-8') as f:
f.write(item['chapter'] + '\n' + item['content'] + '\n\n')
# 这种方法没有设置文件编码,打开可能会乱码
# self.file = open(novel, 'a')
# self.file.write(item['chapter'] + '\n' + item['content'] + '\n\n')
# self.file.close()
# self.file.write(bytes(str(item), encoding='utf-8'))
return item
- 修改
xbiquge/xbiquge/pipelines.py
,将爬取到的内容输入到txt
文件中,按照小说区分(一个小说一个目录,一个章节一个txt
文件)
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import os
class XbiqugePipeline(object):
def process_item(self, item, spider):
# 一部小说放在一个目录下,每个章节保存成一个 txt 文件
curPath = '.'
tempPath = str(item['noval'])
targetPath = curPath + os.path.sep + tempPath
if not os.path.exists(targetPath):
os.makedirs(targetPath)
filenamePath = targetPath + os.path.sep + str(item['chapter']) + '.txt'
with open(filenamePath, 'w', encoding='utf-8') as f:
f.write(item['content'] + '\n')
return item
- 修改
xbiquge/xbiquge/settings.py
文件,使XbiqugePipeline
生效
ITEM_PIPELINES = {
'xbiquge.pipelines.XbiqugePipeline': 300,
}
Spider
思路:
- 要抓取的是一个完整的小说,那么就要先抓取每个章节的链接,然后在通过链接抓取小说的内容抓取的链接的页面
-
在
xbiquge/xbiquge/spiders/
目录下新建爬虫文件singleSpider
:scrapy genspider singleSpider xbiquge.la
-
修改
xbiquge/xbiquge/spiders/singleSpider.py
文件
# -*- coding: utf-8 -*-
import scrapy, re
from xbiquge.items import XbiqugeItem
class SinglespiderSpider(scrapy.Spider):
name = 'singleSpider'
allowed_domains = ['xbiquge.la']
start_urls = ['http://www.xbiquge.la/13/13959/']
def parse(self, response):
listMain = response.xpath('//div[@id="list"]/dl/dd')
for dd in listMain:
item = XbiqugeItem()
item['chapter'] = dd.xpath('./a/text()').extract_first()
item['noval'] = '圣墟'
partial_url = dd.xpath('./a/@href').extract_first()
url = response.urljoin(partial_url)
request = scrapy.Request(url=url, callback=self.parse_body)
# Request中有一个参数是meta
# 作用是将一个信息(任意的类型)传递给callback回调函数
# 当然传递使用字典的形式进行的,
# meta={'key':item} (下面的request.meta【‘item’】可以这么写)
# 这里为了将详情页的内容和主页的内容一起存储,用了meta方法对,主页的内容
# 进行了一个暂存,在后面进行统一的提交
request.meta['item'] = item
yield request
# 爬取详情页的方法
def parse_body(self, response):
# 这是在回调函数中调用刚传进来的信息
item = response.meta['item']
content_list = response.xpath('.//div[@id="content"]').re('([\u4e00-\u9fa5]|<br>)+?')
# 利用re直接匹配小说的汉字内容.正则可以匹配标签下的任何内容,这样我们可以提取我们想要的数据
content_str = ''.join(content_list)
content = re.sub(r'<br\/?>', '\n', content_str)
item['content'] = content
yield item
- 运行:
scrapy crawl singleSpider
CrawlSpider
-
在
xbiquge/xbiquge/spiders/
目录下新建爬虫文件crawlSpider
:scrapy genspider -t crawl crawlSpider xbiquge.la
-
修改
xbiquge/xbiquge/spiders/crawlSpider.py
文件
# -*- coding: utf-8 -*-
import scrapy, re
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from xbiquge.items import XbiqugeItem
class CrawlspiderSpider(CrawlSpider):
name = 'crawlSpider'
allowed_domains = ['xbiquge.la']
start_urls = ['http://www.xbiquge.la/']
rules = (
Rule(LinkExtractor(allow='/\d+?\/\d+?/',restrict_xpaths='//*[@id="hotcontent"]/div[1]',unique=True),callback='parse_item',follow=False),
# 注意使用restricted_xpath提取链接的时候只能到标签就好了,千万不要到具体的标签属性。
)
def parse_item(self, response):
listMain = response.xpath('//div[@id="list"]/dl/dd')
for dd in listMain:
item = XbiqugeItem()
item['chapter'] = dd.xpath('./a/text()').extract_first()
item['noval'] = response.xpath('//div[@id="info"]/h1/text()').extract_first()
partial_url = dd.xpath('./a/@href').extract_first()
url = response.urljoin(partial_url)
request = scrapy.Request(url=url, callback=self.parse_body)
request.meta['item'] = item
yield request
# 爬取详情页的方法
def parse_body(self, response):
# 这是在回调函数中调用刚传进来的信息
item = response.meta['item']
content_list = response.xpath('.//div[@id="content"]').re('([\u4e00-\u9fa5]|<br>)+?')
# 利用re直接匹配小说的汉字内容.正则可以匹配标签下的任何内容,这样我们可以提取我们想要的数据
content_str = ''.join(content_list)
content = re.sub(r'<br\/?>', '\n', content_str)
item['content'] = content
yield item
- 运行:
scrapy crawl crawlSpider
发表评论