爬虫实战(3)--漫画爬取

这次爬取漫画是在我昨天在网上看到的一篇文章【python爬虫】动漫之家漫画下载（scrapy）, 我在这里借鉴了一下, 然后用自己的想法和代码写了一遍, 并稍微优化了一下.

爬取网站所用的网址和Header可以参照上述文章.

分析源代码

搜索页面

先到网站的搜索界面, 随便搜索一下就可以发现搜索的规律: https://m.dmzj.com/search/[keyword].html?, 而翻开搜索界面源代码即可发现, 搜索的结果全部保存在<script>标签下的 serchArry 中, 如图所示:

这时我们就可以使用 BeautifulSoup库和正则表达式将漫画id和name提取出来:

soup = BeautifulSoup(response.body, 'html.parser')    # 获取搜索界面源代码
tag = soup.find_all('script')    # 获取所有'script'标签
text = tag[9].string    # 获取搜索结果所在的标签内容
rule = re.compile(r"serchArry=(.*?)\n")
content = json.loads(rule.search(text).group(1))    # 用正则表达式获取搜索结果的json数据

此时 content 保存的即为所有搜索结果, content[0][‘id’]为搜索第一位的漫画id, content[0][‘name’]为搜索第一位的漫画名, 依次类推. 考虑到漫画地址的规律, 我们很容易就能找到漫画所在的url: comic_url = 'https://m.dmzj.com/info/%s.html' % id

漫画主页面

漫画主页面的数据存储的位置和搜索页面数据存储的位置大同小异, 这里直接贴出代码:

soup = BeautifulSoup(response.body, 'html.parser')    # 获取漫画主界面源代码
tag = soup.find_all('script')    # 获取所有'script'标签
text = tag[10].string    # 获取章节所在的标签内容
section_rule = re.compile(r'"data":(.*?)}]\);.*')
section = json.loads(section_rule.search(text).group(1))    # 用正则表达式获取章节的json数据

这样section中保存就是所有章节的顺序了, section[0][‘comic_id’]为漫画id, section[0][‘id’], 为该章节id, 这样, 按照每章节地址的url规律, 我们也能很容易找到漫画各章节的url:

1 2	for i in section: section_url = 'https://m.dmzj.com/view/%s/%s.html' % (i['comic_id'], i['id'])

漫画内容界面

和上两个界面的数据存储方式简直一模一样, 我们也能很快速地写出代码:

soup = BeautifulSoup(response.body, 'html.parser')    # 获取漫画章节内容源代码
title = soup.find('a', class_='BarTit').string    # 获取漫画章节名
tag = soup.find_all('script')    # 获取所有'script'标签
text = tag[14].string    # 获取图片所在的标签内容
pic_rule = re.compile(r'.*?"page_url":(.*?),"chapter_type".*')
pic_urls = json.loads(pic_rule.search(text).group(1))    # 用正则表达式获取图片地址的json数据

这样title保存的就是章节名了, 可以用来创建文件夹, 而pic_urls中保存的就是所有图片的地址了.

综上, 我们就基本完成了所有前期工作.
下面开始写代码:

scrapy爬虫代码

在命令行界面可以使用 startproject 命令直接创建scrapy项目:
...\>scrapy startproject dmzj
再在项目根目录用命令行创建一个爬虫:
...\dmzj>scrapy genspider spider m.dmzj.com

spider.py

# -*- coding: utf-8 -*-
import scrapy
import re
import json
from bs4 import BeautifulSoup
from dmzj.items import DmzjItem


class SpiderSpider(scrapy.Spider):
    name = 'spider'
    # allowed_domains = ['m.dmzj.com']
    start_urls = ['http://m.dmzj.com/']

    def __init__(self, name):
        self.name = name

    def start_requests(self):
        self.url = 'https://m.dmzj.com/search/%s.html?' % self.name
        yield scrapy.http.Request(url=self.url, callback=self.parse_comic)

    def parse_comic(self, response):
        soup = BeautifulSoup(response.body, 'html.parser')    # 获取搜索界面源代码
        tag = soup.find_all('script')    # 获取所有'script'标签
        text = tag[9].string    # 获取搜索结果所在的标签内容
        rule = re.compile(r"serchArry=(.*?)\n")
        content = json.loads(rule.search(text).group(1))    # 用正则表达式获取搜索结果的json数据
        self.name = content[0]['name']    # 储存漫画名, 用于文件夹起名
        comic_url = 'https://m.dmzj.com/info/%s.html' % content[0]['id']    #获得的漫画地址
        yield scrapy.http.Request(url=comic_url, callback=self.parse_section)

    def parse_section(self, response):
        soup = BeautifulSoup(response.body, 'html.parser')    # 获取漫画主界面源代码
        tag = soup.find_all('script')    # 获取所有'script'标签
        text = tag[10].string    # 获取章节所在的标签内容
        section_rule = re.compile(r'"data":(.*?}])}')
        section = json.loads(section_rule.search(text).group(1))    # 用正则表达式获取章节的json数据
        for i in section:    # 迭代获取所有章节的url
            section_url = 'https://m.dmzj.com/view/%s/%s.html' % (
                i['comic_id'], i['id'])
            yield scrapy.http.Request(url=section_url, callback=self.parse_pic)

    def parse_pic(self, response):
        soup = BeautifulSoup(response.body, 'html.parser')    # 获取漫画章节内容源代码
        title = soup.find('a', class_='BarTit').string    # 获取漫画章节名
        tag = soup.find_all('script')    # 获取所有'script'标签
        text = tag[14].string    # 获取图片所在的标签内容
        pic_rule = re.compile(r'.*?"page_url":(.*?),"chapter_type".*')
        pic_urls = json.loads(pic_rule.search(text).group(1))    # 用正则表达式获取图片地址的json数据
        item = DmzjItem(name=self.name, title=title, pic_urls=pic_urls)    # 创建item类型的数据返回给pipelines
        yield item

setting.py

# -*- coding: utf-8 -*-

# Scrapy settings for dmzj project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'dmzj'

SPIDER_MODULES = ['dmzj.spiders']
NEWSPIDER_MODULE = 'dmzj.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'dmzj (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False    # 本代码仅供学习使用, 所有可以不遵守robots协议

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {    # 设置本地headers
    'accept': '*/*',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
#    'dmzj.middlewares.DmzjSpiderMiddleware': 543,
# }

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
#    'dmzj.middlewares.DmzjDownloaderMiddleware': 543,
# }

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
# }

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {    # 打开pipelines
    'dmzj.pipelines.DmzjPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

items.py

import scrapy


class DmzjItem(scrapy.Item):    # item数据类型
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()    # 储存漫画名
    title = scrapy.Field()    # 储存章节名
    pic_urls = scrapy.Field()    # 储存漫画地址

pipelines.py

import os
import requests


class DmzjPipeline(object):    # 用于储存文件
    def process_item(self, item, spider):
        headers = {    #自定义headers不然可能或获取不到图片
            'accept': '*/*',
            'accept-encoding': 'gzip, deflate, br',
            'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8',
            'cookie': 'UM_distinctid=16a7ceb8612533-0114c357f3f81b-36664c08-1fa400-16a7ceb8615585; show_tip_1=0; pt_198bb240=uid=DljdidnIfmIFIFZeIIOiMA&nid=0&vid=orRtHwGrG8aH/suqp3EdNQ&vn=3&pvn=1&sact=1556880697211&to_flag=0&pl=J8gHIAMoYA2Eg1lD2m4zWQ*pt*1556880683493; RORZ_7f25_saltkey=fcS2577r; RORZ_7f25_lastvisit=1556890761; RORZ_7f25_sid=Q9c84A; RORZ_7f25_lastact=1556894361%09member.php%09logging',
            'referer': 'https://m.dmzj.com/info/yishijieshushu.html',
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
        }
        # 创建存储文件的文件夹, 主文件夹名用漫画名, 次文件夹名为章节名
        path = './'
        comic_path = os.path.join(path, item['name'])
        if not os.path.exists(comic_path):
            os.mkdir(comic_path)
        section_path = os.path.join(comic_path, item['title'])
        if not os.path.exists(section_path):
            os.mkdir(section_path)

        # 获取所有的图片, 并存储下来
        for pic_url in item['pic_urls']:
            capter = pic_url.split('/')[-1]    # 获取图片的url
            response = requests.get(pic_url, headers=headers).content    # 获取图片的byte数据
            with open(section_path + r'\\' + capter, 'wb') as f:
                f.write(response)
                print("下载完成: " + item['name']+" "+item['title']+" " + capter)
        return item

最后再自己创建一个start.py文件, 放在项目根目录里, 用来启动爬虫.

from scrapy import cmdline
if __name__ == '__main__':
    name = input("请输入漫画名：").strip()
    cmdline.execute(str("scrapy crawl spider -a name=%s" %
                        (name)).split())    # 启动爬虫