菜鸟笔记
提升您的技术认知

python基础之scrapy简介-ag真人游戏

scrapy作为爬虫的进阶内容,可以实现多线程爬取目标内容,简化代码逻辑,提高开发效率,深受爬虫开发者的喜爱,本文主要以爬取某股票网站为例,简述如何通过scrapy实现爬虫,仅供学习分享使用,如有不足之处,还请指正。

什么是scrapy?

scrapy是用python实现的一个为了爬取网站数据,提取结构性数据而编写的应用框架。使用twisted高效异步网络框架来处理网络通信。scrapy架构:

关于scrapy架构各项说明,如下所示:

  • scrapyengine:引擎。负责控制数据流在系统中所有组件中流动,并在相应动作发生时触发事件。 此组件相当于爬虫的“大脑”,是 整个爬虫的调度中心。 
  • schedule:调度器。接收从引擎发过来的requests,并将他们入队。初始爬取url和后续在页面里爬到的待爬取url放入调度器中,等待被爬取。调度器会自动去掉重复的url。
  • downloader:下载器。负责获取页面数据,并提供给引擎,而后提供给spider。
  • spider:爬虫。用户编些用于分析response并提取item和额外跟进的url。将额外跟进的url提交给scrapyengine,加入到schedule中。将每个spider负责处理一个特定(或 一些)网站。 
  • itempipeline:负责处理被spider提取出来的item。当页面被爬虫解析所需的数据存入item后,将被发送到pipeline,并经过设置好次序
  • downloadermiddlewares:下载中间件。是在引擎和下载器之间的特定钩子(specific hook),处理它们之间的请求(request)和响应(response)。提供了一个简单的机制,通过插入自定义代码来扩展scrapy功能。通过设置downloadermiddlewares来实现爬虫自动更换user-agent,ip等。
  • spidermiddlewares:spider中间件。是在引擎和spider之间的特定钩子(specific hook),处理spider的输入(response)和输出(items或requests)。提供了同样简单机制,通过插入自定义代码来扩展scrapy功能。

scrapy数据流:

  1. scrapyengine打开一个网站,找到处理该网站的spider,并向该spider请求第一个(批)要爬取的;
  2. scrapyengine向调度器请求第一个要爬取的url,并加入到schedule作为请求以备调度;
  3. scrapyengine向调度器请求下一个要爬取的url;
  4. schedule返回下一个要爬取的url给scrapyengine,scrapyengine通过downloadermiddlewares将url转发给downloader;
  5. 页面下载完毕,downloader生成一个页面的response,通过downloadermiddlewares发送给scrapyengine;
  6. scrapyengine从downloader中接收到response,通过spidermiddlewares发送给spider处理;
  7. spider处理response并返回提取到的item以及新的request给scrapyengine;
  8. scrapyengine将spider返回的item交给itempipeline,将spider返回的request交给schedule进行从第二步开始的重复操作,直到调度器中没有待处理的request,scrapyengine关闭。

scrapy安装

在命令行模式下,通过pip install scrapy命令进行安装scrapy,如下所示:

当出现以下提示信息时,表示安装成功 

scrapy创建项目

在命令行模式下,切换到项目存放目录,通过scrapy startproject stockstar 创建爬虫项目,如下所示:

 根据提示,通过提供的模板,创建爬虫【命令格式:scrapy genspider 爬虫名称 域名】,如下所示:

 注意:爬虫名称,不能跟项目名称一致,否则会报错,如下所示:

 通过pycharm打开新创建的scrapy项目,如下所示:

爬取目标

本例主要爬取某证券网站行情中心股票id与名称信息,如下所示:

scrapy爬虫开发

通过命令行创建项目后,基本scrapy爬虫框架已经形成,剩下的就是业务代码填充。

item项定义

定义需要爬取的字段信息,如下所示:

class stockstaritem(scrapy.item):
    """
    定义需要爬取的字段名称
    """
    # define the fields for your item here like:
    # name = scrapy.field()
    stock_type = scrapy.field()  # 股票类型
    stock_id = scrapy.field()  # 股票id
    stock_name = scrapy.field()  # 股票名称

定制爬虫逻辑

scrapy的爬虫结构是固定的,定义一个类,继承自scrapy.spider,类中定义属性【爬虫名称,域名,起始url】,重写父类方法【parse】,根据需要爬取的页面逻辑不同,在parse中定制不同的爬虫代码,如下所示:

class stockspider(scrapy.spider):
    name = 'stock'
    allowed_domains = ['quote.stockstar.com']  # 域名
    start_urls = ['http://quote.stockstar.com/stock/stock_index.htm']  # 启动的url
    def parse(self, response):
        """
        解析函数
        :param response:
        :return:
        """
        item = stockstaritem()
        styles = ['沪a', '沪b', '深a', '深b']
        index = 0
        for style in styles:
            print('********************本次抓取'   style[index]   '股票********************')
            ids = response.xpath(
                '//div[@class="w"]/div[@class="main clearfix"]/div[@class="seo_area"]/div['
                '@class="seo_keywordscon"]/ul[@id="index_data_'   str(index)   '"]/li/span/a/text()').getall()
            names = response.xpath(
                '//div[@class="w"]/div[@class="main clearfix"]/div[@class="seo_area"]/div['
                '@class="seo_keywordscon"]/ul[@id="index_data_'   str(index)   '"]/li/a/text()').getall()
            # print('ids = ' str(ids))
            # print('names = '   str(names))
            for i in range(len(ids)):
                item['stock_type'] = style
                item['stock_id'] = str(ids[i])
                item['stock_name'] = str(names[i])
                yield item

数据处理

在pipeline中,对抓取的数据进行处理,本例为简便,在控制进行输出,如下所示:

class stockstarpipeline:
    def process_item(self, item, spider):
        print('股票类型>>>>' item['stock_type'] '股票代码>>>>' item['stock_id'] '股票名称>>>>' item['stock_name'])
        return item

注意:在对item进行赋值时,只能通过item['key']=value的方式进行赋值,不可以通过item.key=value的方式赋值。

scrapy配置

通过settings.py文件进行配置,包括请求头,管道,robots协议等内容,如下所示:

# scrapy settings for stockstar project
#
# for simplicity, this file contains only settings considered important or
# commonly used. you can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
bot_name = 'stockstar'
spider_modules = ['stockstar.spiders']
newspider_module = 'stockstar.spiders'
# crawl responsibly by identifying yourself (and your website) on the user-agent
#user_agent = 'stockstar ( http://www.yourdomain.com)'
# obey robots.txt rules 是否遵守robots协议
robotstxt_obey = false
# configure maximum concurrent requests performed by scrapy (default: 16)
#concurrent_requests = 32
# configure a delay for requests for the same website (default: 0)
# see https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# see also autothrottle settings and docs
#download_delay = 3
# the download delay setting will honor only one of:
#concurrent_requests_per_domain = 16
#concurrent_requests_per_ip = 16
# disable cookies (enabled by default)
#cookies_enabled = false
# disable telnet console (enabled by default)
#telnetconsole_enabled = false
# override the default request headers:
default_request_headers = {
  # 'accept': 'text/html,application/xhtml xml,application/xml;q=0.9,*/*;q=0.8',
  'user-agent': 'mozilla/5.0 (linux; android 6.0; nexus 5 build/mra58n) applewebkit/537.36 (khtml, like gecko) chrome/92.0.4515.131 mobile safari/537.36' #,
  # 'accept-language': 'en,zh-cn,zh;q=0.9'
}
# enable or disable spider middlewares
# see https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#spider_middlewares = {
#    'stockstar.middlewares.stockstarspidermiddleware': 543,
#}
# enable or disable downloader middlewares
# see https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#downloader_middlewares = {
#    'stockstar.middlewares.stockstardownloadermiddleware': 543,
#}
# enable or disable extensions
# see https://docs.scrapy.org/en/latest/topics/extensions.html
#extensions = {
#    'scrapy.extensions.telnet.telnetconsole': none,
#}
# configure item pipelines
# see https://docs.scrapy.org/en/latest/topics/item-pipeline.html
item_pipelines = {
   'stockstar.pipelines.stockstarpipeline': 300,
}
# enable and configure the autothrottle extension (disabled by default)
# see https://docs.scrapy.org/en/latest/topics/autothrottle.html
#autothrottle_enabled = true
# the initial download delay
#autothrottle_start_delay = 5
# the maximum download delay to be set in case of high latencies
#autothrottle_max_delay = 60
# the average number of requests scrapy should be sending in parallel to
# each remote server
#autothrottle_target_concurrency = 1.0
# enable showing throttling stats for every response received:
#autothrottle_debug = false
# enable and configure http caching (disabled by default)
# see https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#httpcache_enabled = true
#httpcache_expiration_secs = 0
#httpcache_dir = 'httpcache'
#httpcache_ignore_http_codes = []
#httpcache_storage = 'scrapy.extensions.httpcache.filesystemcachestorage'

scrapy运行

因scrapy是各个独立的页面,只能通过终端命令行的方式运行,格式为:scrapy crawl 爬虫名称,如下所示:

scrapy crawl stock

如下图所示:

备注

本例内容相对简单,仅为说明scrapy的常见用法,爬取的内容都是第一次请求能够获取到源码的内容,即所见即所得。实例源码

网站地图