Skip to content

arjun-go-go/scrapy_splash_tianyancha

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

scrapy_splash天眼查爬虫

Splash是一个Javascript渲染服务。它是一个实现了HTTP API的轻量级浏览器,Splash是用Python实现的,同时使用Twisted和QT。Twisted(QT)用来让服务具有异步处理能力,以发挥webkit的并发能力。

通过Docker安装splash

从docker hub下载相关镜像文件

sudo docker pull scrapinghub/splash

启动splash服务

使用docker启动服务命令启动Splash服务

sudo docker run -p 8050:8050 scrapinghub/splash

配置splash服务(以下操作全部在settings.py):

1)添加splash服务器地址:

SPLASH_URL = 'http://localhost:8050'

2)将splash middleware添加到DOWNLOADER_MIDDLEWARE中:

DOWNLOADER_MIDDLEWARES = {

'scrapy_splash.SplashCookiesMiddleware': 723,

'scrapy_splash.SplashMiddleware': 725,

'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,

}

3)Enable SplashDeduplicateArgsMiddleware:

SPIDER_MIDDLEWARES = {

'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,

}

4)Set a custom DUPEFILTER_CLASS:

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

5)a custom cache storage backend:

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

About

天眼查爬虫,splash

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages