Skip to content
This repository has been archived by the owner on Sep 2, 2024. It is now read-only.

NewsScraper-Django: A powerful news scraping solution using BeautifulSoup and Selenium, seamlessly integrated with Django. Effortlessly extract articles, handle JavaScript-rendered content, and present results through a user-friendly web interface

License

Notifications You must be signed in to change notification settings

ThisIs-Developer/News-Scraping-using-BeautyfulSoup-Selenium-with-Django

Repository files navigation

News-Scraping-using-BeautifulSoup-Selenium-with-Django

Apache 2.0 License LinkedIn

Table of Contents

  1. About The Project
  2. Features
  3. Getting Started
  4. Usage
  5. Changelog
  6. License
  7. Acknowledgements

About The Project

This project automates the process of scraping news articles from various sources using BeautifulSoup and Selenium, integrated into a Django application. It supports multiple websites and can run scraping tasks concurrently using threading. The data is stored in an Excel file and optionally in a MySQL database.

Features

  • Scrapes news articles from Hindustan Times, Hindustan Times Bangla, Zee News, TV9 Bangla, and Ananda Bazar.
  • Concurrent scraping using threading with a delay between iterations.
  • Supports both on-demand scraping and scheduled scraping tasks.
  • Saves scraped data to Excel files and a MySQL database.
  • Creates a new folder for data storage based on the current date.

Images of the Scraping

Getting Started

To get a local copy up and running, follow these steps.

Prerequisites

  • Python 3.6+
  • Django 3.0+
  • Selenium
  • BeautifulSoup
  • MySQL (for database storage)

Installation

  1. Clone the repository:
    git clone https://github.com/ThisIs-Developer/News-Scraping-using-BeautifulSoup-Selenium-with-Django.git
  2. Navigate to the project directory:
    cd News-Scraping-using-BeautifulSoup-Selenium-with-Django
  3. Install required Python packages:
    pip install -r requirements.txt
  4. Set up the Django project:
    python manage.py migrate
    python manage.py createsuperuser
  5. Update the database configuration in settings.py if using MySQL.

Getting Started with Selenium

What is Web Scraping?

Web scraping is a technique for extracting information from the internet automatically using software that simulates human web surfing.

What is Selenium?

Selenium is a free (open-source) automated testing framework used to validate web applications across different browsers and platforms. It can be used for automating web browsers to do a number of tasks such as web-scraping.

Installing Selenium

To install Selenium:

pip install selenium # (Python 2)
pip3 install selenium # (Python 3)

Installing Webdrivers

Selenium requires a driver to interface with the chosen browser. Firefox, for example, requires geckodriver, which needs to be installed before the below examples can be run. Note that the webdriver must be located in your PATH, e.g., place it in /usr/bin or /usr/local/bin.

Other supported browsers will have their own drivers available. Links to some of the more popular browser drivers are as follows:

For this project, I am using Chrome's webdriver called Chromedriver. There are multiple ways to install Chromedriver:

  1. Using webdriver-manager (recommended)

    • Install package:
      pip install webdriver-manager # (Python 2)
      pip3 install webdriver-manager # (Python 3)
    • Load package:
      from selenium import webdriver
      from webdriver_manager.chrome import ChromeDriverManager
      
      driver = webdriver.Chrome(ChromeDriverManager().install())
  2. Manual download from Chrome's website

    • Load package:
      from selenium import webdriver
      
      driver = webdriver.Chrome(executable_path='/path/to/chromedriver')

Usage

Run the Django development server:

python manage.py runserver

Navigate to the admin panel, configure the scraping tasks, and start the scraping process. The scraped data will be saved in the specified formats and locations.

Changelog

v1.0.1

  • Initial release with scraping from Hindustan Times.

v1.0.1.1

  • Added scraping from Hindustan Times Bangla.

v1.0.1.2

  • Added scraping from Zee News.

v1.0.1.3

  • Added scraping from TV9 Bangla.

v1.0.1.4

  • Added scraping from Ananda Bazar.

v2.0.1

  • Dynamic scraping based on request value.

v2.0.1.1

  • Appending data to existing scraped_data.xlsx.

v3.0.1

  • Concurrent scraping with threading.

v3.0.1.1

  • Automatic folder creation based on current date.

v4.0.1

  • Integration with MySQL database and updated Django models.

License

Distributed under the Apache License 2.0. See LICENSE for more information.

Acknowledgements

About

NewsScraper-Django: A powerful news scraping solution using BeautifulSoup and Selenium, seamlessly integrated with Django. Effortlessly extract articles, handle JavaScript-rendered content, and present results through a user-friendly web interface

Topics

Resources

License

Stars

Watchers

Forks