Postgres to MongoDB Data Migration Project

This project involves retrieving data from MedQ Database(Postgres), transforming it, and loading it to MongoDB using Apache Airflow. The project ensures efficient data processing and storage.

Project Overview

The main goal of this project is to retrieve data from MedQ Database, perform necessary transformations, and load the data to MongoDB. The ETL (Extract, Transform, Load) process is orchestrated using Apache Airflow to ensure smooth and efficient data handling.

Technologies Used

Apache Airflow: Orchestrates the ETL process.
Python: For scripting and data transformation.
Pandas: For data manipulation and transformation.
Docker Compose: To manage the Airflow and PostgreSQL services.
MedQ Database API: Source of the data.

Setup and Installation

Prerequisites

Ensure you have the following installed on your system:

Docker
Docker Compose

Steps

Clone the repository

git clone https://github.com/MekWiset/PostgresToMongoDB_migration_project.git
cd PostgresToMongoDB_migration_project

Create an airflow.env file from the example and configure your environment variables
```
cp airflow.env.example airflow.env
```
Edit the airflow.env file and fill in the necessary values
```
nano airflow.env
```
Create an .env file from the example and configure your sensitive information
```
cp env.example .env
nano .env
```
Build and start the services using Docker Compose
```
docker-compose up -d
```
Run the postgres_to_mongodb pipeline
```
airflow trigger_dag postgres_to_mongodb
```

Usage

To run the ETL pipeline with Airflow UI, follow these steps:

Access the Airflow UI at http://localhost:8080 and trigger the DAG for the ETL process.
Monitor the DAG execution and check logs for any issues.
Verify the transformed data in the output directory or the specified destination.

Instructions

Extract Data:
- Install the required dependencies:
```
pip install -r requirements.txt
```
- Set up your Postgres database connection in the environment variables or configuration files.
- The data extraction is handled by the postgres_extractor.py script located in the plugins/extract directory. The script retrieves data from the Postgres database connection and stores it in the data/medq_data.csv file.
Transform Data:
- Data transformation is performed using the data_transformer.py scripts in the plugins/transform directory.
- data_transformer.py processes the raw data and saves the transformed data to data/medq_data_transformed.csv.
Load Data:
- The mongo_loader.py script in the plugins/load directory handles data loading.
- It exports the transformed data to MongoDB as specified.
Run the DAG:
- Ensure the DAG defined in dags/pg_to_mongo_dag.py is scheduled and triggered as required:
```
airflow trigger_dag postgres_to_mongodb
```

Project Structure

Dockerfile: Dockerfile for setting up the project environment.
README.md: Documentation for the project.
airflow.env.example: Template for Airflow environment variables.
env.example: Template for sensitive environment variables.
dags/: Directory containing DAGs for Apache Airflow.
- pg_to_mongo_dag.py: Main DAG for the ETL process.
- helpers/: Directory for helper modules.
  - sql_query.py: Helper functions for SQL queries.
data/: Directory for storing dataset files.
- medq_data.csv: Extracted data file.
- medq_data_transformed.csv: Transformed data output file.
- ref_hospital.xlsx: Reference hospital data file.
docker-compose.yaml: Docker Compose configuration file for orchestrating multi-container Docker applications.
plugins/: Directory for custom plugins.
- extract/: Directory for data extraction plugins.
- load/: Directory for data loading plugins.
- transform/: Directory for data transformation plugins.
requirements.txt: Python dependencies file.

Features

Data Extraction: Retrieves data from MedQ Database.
Data Transformation: Processes and transforms the data using Pandas.
Data Loading: Exports transformed data to MongoDB.
Incremental Data Updates: Efficiently handles newly added data.

Contributing

Contributions are welcome! To contribute:

Fork the repository.
Create a new branch (git checkout -b feature-branch).
Make your changes and commit them (git commit -m 'Add new feature').
Push to the branch (git push origin feature-branch).
Open a pull request.

Please ensure your code follows the project's coding standards and includes relevant tests.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Postgres to MongoDB Data Migration Project

Table of Contents

Project Overview

Technologies Used

Setup and Installation

Prerequisites

Steps

Usage

Instructions

Project Structure

Features

Contributing

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
dags		dags
plugins		plugins
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
airflow.env.example		airflow.env.example
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt

MekWiset/PostgresToMongoDB_migration_project

Folders and files

Latest commit

History

Repository files navigation

Postgres to MongoDB Data Migration Project

Table of Contents

Project Overview

Technologies Used

Setup and Installation

Prerequisites

Steps

Usage

Instructions

Project Structure

Features

Contributing

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages