Scrapy 初体验

#Python 2021/04/25 18:40:51

设置 python 虚拟环境
Scrapy 开始项目

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

设置 python 虚拟环境

为了避免安装第三方依赖时总是安装在全局而导致不同项目之间的冲突，python 可以使用虚拟环境的方式满足不同应用的需求。

The solution for this problem is to create a virtual environment, a self-contained directory tree that contains a Python installation for a particular version of Python, plus a number of additional packages.

python -m venv virtual-env

命令会创建一个 virtual-enviroment 的文件夹，里面包含 python 编译器、标准库和其他一些文件，其中在 Scripts 文件夹下包含激活虚拟环境的执行文件。

activate.bat 在 windows cmd 里运行 virtual-env\Scripts\activate.bat
Activate.PS1 在 windows pwd 里运行 virtual-env\Scripts\Activate.PS1
activate 在 Unix 和 MacOs 里运行 source virtual-env/bin/activate
deactivate.bat 取消虚拟环境

后面安装 scrapy 等都会安装在虚拟环境里

Scrapy 开始项目

scrapy startproject tutorial test

在 test 文件夹下创建名为 tutorial 的 Scrapy project，test 文件夹下有以下内容

    scrapy.cfg            # deploy configuration file
    tutorial/             # project's Python module, you'll import your code from here
        __init__.py
        items.py          # project items definition file
        middlewares.py    # project middlewares file
        pipelines.py      # project pipelines file
        settings.py       # project settings file
        spiders/          # a directory where you'll later put your spiders
            __init__.py

在 spiders 目录下写自己的爬虫文件，在 setting.py 里可以配置请求头等一些内容

#spiders/douban250_spider.py
import scrapy
class DoubanSpider(scrapy.Spider):
    name = 'douban'
    start_urls = [
        'https://movie.douban.com/top250']
    def parse(self, response):
        for title in response.css('div.hd'):
            yield{
                'title':title.css('span.title::text').get()
            }
        yield from response.follow_all(css='span.next a',callback=self.parse)

#setting.py
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7',
    'Content-Type': 'text/html; charset=utf-8',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36'
}
FEED_EXPORT_ENCODING = "UTF-8" #设置导出编码格式

scrapy crawl douabn -O douban-250.json #执行并导出结果为 json 文件

scrapy shell "https://movie.douban.com/top250" #在命令行里测试请求的页面信息

参考链接：

Scrapy 文档