This repository contains a Python script that automates the process of downloading data from the WHO Global Tuberculosis Programme website and performs basic data processing techniques on the downloaded data.
The script utilizes Selenium, a web automation tool, to navigate to the WHO Global Tuberculosis Programme data page and download the data in CSV format. It then processes the downloaded CSV file using pandas, a powerful data manipulation library in Python, to perform the following data processing techniques:
- Handling Missing Values: Drops rows with any missing values.
- Data Transformation: Converts string columns to lowercase.
- Data Aggregation: Groups data by country and calculates the mean of numeric columns.
- Data Filtering: Filters rows based on a condition.
The processed data is then saved as separate CSV files in the specified output directory.
To run the script, you need the following:
- Python 3.x installed on your system.
- The necessary Python packages installed:
selenium
,pandas
. - WebDriver installed and its path configured in the script.
- Clone the repository to your local machine.
- Install the required Python packages using pip:
pip install -r requirements.txt
- Configure the path to the WebDriver in the script
- Run the script
web_scraping.py
. - The processed data will be saved as CSV files in the specified output directory.
Feel free to contribute to this project by opening issues or pull requests!
I'm a full stack Web & App Developer and an undergrad Data Science Student 👨💻🙌