Skip to content

Latest commit

 

History

History
49 lines (28 loc) · 1.32 KB

README.md

File metadata and controls

49 lines (28 loc) · 1.32 KB

Scout

This is a data aggregation framework for scouting and aggregating Scientific Data.

The framework contains 3 modules:

  • scider - a scientific data spider
  • sanitizer - sanitising the aggregated data to use it further for text mining and processing.
  • db - database module that stored the data into database (Currently supports MongoDB only)

How to install

#Install scout development version, no stable version yet
pip install -e  git+https://github.com/invaana/scout.git#egg=scout

How to use

Step1:  Create a scider input json file 
# example : examples/configs/github.json

from scout.scider.tasks import scrape_website_task
from scout.scider import helpers

config_file = "configs/github.json"
config = helpers.read_json_file(config_file)

scrape_website_task(config=config, max_limit=30, save=True) 

:param config: config file in dict format
:param max_limit: max number of entry scraping after which, the scraper should halt
:param save: should the data be saved to db.

To run the job in queue scrape_website_task.delay(config=config, max_limit=30, save=True)

This module is designed by Data Science team for internal usage at Invaana. If you are a scientific data enthusiast, we'd love to know more about your interests. Let us know @invaana !