获取网址和Header

这次爬取漫画是在我昨天在网上看到的一篇文章 【python爬虫】动漫之家漫画下载(scrapy), 我在这里借鉴了一下, 然后用自己的想法和代码写了一遍, 并稍微优化了一下.

爬取网站所用的网址和Header可以参照上述文章.

分析源代码

搜索页面

先到网站的搜索界面, 随便搜索一下就可以发现搜索的规律: https://m.dmzj.com/search/[keyword].html?, 而翻开搜索界面源代码即可发现, 搜索的结果全部保存在<script>标签下的 serchArry 中, 如图所示:

01

这时我们就可以使用 BeautifulSoup库 和正则表达式将漫画id和name提取出来:

1
2
3
4
5
soup = BeautifulSoup(response.body, 'html.parser')    # 获取搜索界面源代码
tag = soup.find_all('script') # 获取所有'script'标签
text = tag[9].string # 获取搜索结果所在的标签内容
rule = re.compile(r"serchArry=(.*?)\n")
content = json.loads(rule.search(text).group(1)) # 用正则表达式获取搜索结果的json数据

此时 content 保存的即为所有搜索结果, content[0][‘id’]为搜索第一位的漫画id, content[0][‘name’]为搜索第一位的漫画名, 依次类推. 考虑到漫画地址的规律, 我们很容易就能找到漫画所在的url: comic_url = 'https://m.dmzj.com/info/%s.html' % id

漫画主页面

漫画主页面的数据存储的位置和搜索页面数据存储的位置大同小异, 这里直接贴出代码:

1
2
3
4
5
soup = BeautifulSoup(response.body, 'html.parser')    # 获取漫画主界面源代码
tag = soup.find_all('script') # 获取所有'script'标签
text = tag[10].string # 获取章节所在的标签内容
section_rule = re.compile(r'"data":(.*?)}]\);.*')
section = json.loads(section_rule.search(text).group(1)) # 用正则表达式获取章节的json数据

这样section中保存就是所有章节的顺序了, section[0][‘comic_id’]为漫画id, section[0][‘id’], 为该章节id, 这样, 按照每章节地址的url规律, 我们也能很容易找到漫画各章节的url:

1
2
for i in section:
section_url = 'https://m.dmzj.com/view/%s/%s.html' % (i['comic_id'], i['id'])

漫画内容界面

和上两个界面的数据存储方式简直一模一样, 我们也能很快速地写出代码:

1
2
3
4
5
6
soup = BeautifulSoup(response.body, 'html.parser')    # 获取漫画章节内容源代码
title = soup.find('a', class_='BarTit').string # 获取漫画章节名
tag = soup.find_all('script') # 获取所有'script'标签
text = tag[14].string # 获取图片所在的标签内容
pic_rule = re.compile(r'.*?"page_url":(.*?),"chapter_type".*')
pic_urls = json.loads(pic_rule.search(text).group(1)) # 用正则表达式获取图片地址的json数据

这样title保存的就是章节名了, 可以用来创建文件夹, 而pic_urls中保存的就是所有图片的地址了.

综上, 我们就基本完成了所有前期工作.
下面开始写代码:

scrapy爬虫代码

在命令行界面可以使用 startproject 命令直接创建scrapy项目:
...\>scrapy startproject dmzj
再在项目根目录用命令行创建一个爬虫:
...\dmzj>scrapy genspider spider m.dmzj.com

spider.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# -*- coding: utf-8 -*-
import scrapy
import re
import json
from bs4 import BeautifulSoup
from dmzj.items import DmzjItem


class SpiderSpider(scrapy.Spider):
name = 'spider'
# allowed_domains = ['m.dmzj.com']
start_urls = ['http://m.dmzj.com/']

def __init__(self, name):
self.name = name

def start_requests(self):
self.url = 'https://m.dmzj.com/search/%s.html?' % self.name
yield scrapy.http.Request(url=self.url, callback=self.parse_comic)

def parse_comic(self, response):
soup = BeautifulSoup(response.body, 'html.parser') # 获取搜索界面源代码
tag = soup.find_all('script') # 获取所有'script'标签
text = tag[9].string # 获取搜索结果所在的标签内容
rule = re.compile(r"serchArry=(.*?)\n")
content = json.loads(rule.search(text).group(1)) # 用正则表达式获取搜索结果的json数据
self.name = content[0]['name'] # 储存漫画名, 用于文件夹起名
comic_url = 'https://m.dmzj.com/info/%s.html' % content[0]['id'] #获得的漫画地址
yield scrapy.http.Request(url=comic_url, callback=self.parse_section)

def parse_section(self, response):
soup = BeautifulSoup(response.body, 'html.parser') # 获取漫画主界面源代码
tag = soup.find_all('script') # 获取所有'script'标签
text = tag[10].string # 获取章节所在的标签内容
section_rule = re.compile(r'"data":(.*?}])}')
section = json.loads(section_rule.search(text).group(1)) # 用正则表达式获取章节的json数据
for i in section: # 迭代获取所有章节的url
section_url = 'https://m.dmzj.com/view/%s/%s.html' % (
i['comic_id'], i['id'])
yield scrapy.http.Request(url=section_url, callback=self.parse_pic)

def parse_pic(self, response):
soup = BeautifulSoup(response.body, 'html.parser') # 获取漫画章节内容源代码
title = soup.find('a', class_='BarTit').string # 获取漫画章节名
tag = soup.find_all('script') # 获取所有'script'标签
text = tag[14].string # 获取图片所在的标签内容
pic_rule = re.compile(r'.*?"page_url":(.*?),"chapter_type".*')
pic_urls = json.loads(pic_rule.search(text).group(1)) # 用正则表达式获取图片地址的json数据
item = DmzjItem(name=self.name, title=title, pic_urls=pic_urls) # 创建item类型的数据返回给pipelines
yield item

setting.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
# -*- coding: utf-8 -*-

# Scrapy settings for dmzj project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'dmzj'

SPIDER_MODULES = ['dmzj.spiders']
NEWSPIDER_MODULE = 'dmzj.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'dmzj (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False # 本代码仅供学习使用, 所有可以不遵守robots协议

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = { # 设置本地headers
'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
# 'dmzj.middlewares.DmzjSpiderMiddleware': 543,
# }

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
# 'dmzj.middlewares.DmzjDownloaderMiddleware': 543,
# }

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
# }

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = { # 打开pipelines
'dmzj.pipelines.DmzjPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

items.py

1
2
3
4
5
6
7
8
9
import scrapy


class DmzjItem(scrapy.Item): # item数据类型
# define the fields for your item here like:
# name = scrapy.Field()
name = scrapy.Field() # 储存漫画名
title = scrapy.Field() # 储存章节名
pic_urls = scrapy.Field() # 储存漫画地址

pipelines.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import os
import requests


class DmzjPipeline(object): # 用于储存文件
def process_item(self, item, spider):
headers = { #自定义headers不然可能或获取不到图片
'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8',
'cookie': 'UM_distinctid=16a7ceb8612533-0114c357f3f81b-36664c08-1fa400-16a7ceb8615585; show_tip_1=0; pt_198bb240=uid=DljdidnIfmIFIFZeIIOiMA&nid=0&vid=orRtHwGrG8aH/suqp3EdNQ&vn=3&pvn=1&sact=1556880697211&to_flag=0&pl=J8gHIAMoYA2Eg1lD2m4zWQ*pt*1556880683493; RORZ_7f25_saltkey=fcS2577r; RORZ_7f25_lastvisit=1556890761; RORZ_7f25_sid=Q9c84A; RORZ_7f25_lastact=1556894361%09member.php%09logging',
'referer': 'https://m.dmzj.com/info/yishijieshushu.html',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
}
# 创建存储文件的文件夹, 主文件夹名用漫画名, 次文件夹名为章节名
path = './'
comic_path = os.path.join(path, item['name'])
if not os.path.exists(comic_path):
os.mkdir(comic_path)
section_path = os.path.join(comic_path, item['title'])
if not os.path.exists(section_path):
os.mkdir(section_path)

# 获取所有的图片, 并存储下来
for pic_url in item['pic_urls']:
capter = pic_url.split('/')[-1] # 获取图片的url
response = requests.get(pic_url, headers=headers).content # 获取图片的byte数据
with open(section_path + r'\\' + capter, 'wb') as f:
f.write(response)
print("下载完成: " + item['name']+" "+item['title']+" " + capter)
return item

最后再自己创建一个start.py文件, 放在项目根目录里, 用来启动爬虫.

1
2
3
4
5
from scrapy import cmdline
if __name__ == '__main__':
name = input("请输入漫画名:").strip()
cmdline.execute(str("scrapy crawl spider -a name=%s" %
(name)).split()) # 启动爬虫