GitHub - despinoza119/Entity_Resolution_and_Clustering: As a data scientist at Retailer A, I'm tasked with performing entity resolution between Retailer A and Retailer B's product descriptions datasets to identify overlapping products.

Record Linkage

Project Overview

This project, developed for Data Driven Business class at Barcelona Technology School, focuses on finding the common products between two retailers to help the marketing team to create more personalized product offering campaigns and product indexes.

Context

Let's consider that you, as a data scientist, are working for Retailer A, a large-scale department store chain. Retailer A has recently entered into a strategic partnership with Retailer B, an online e-commerce platform specializing in products. As part of this partnership, Retailer B has shared its product descriptions dataset with Retailer A for the purpose of cross-promotion, product indexing and targeted marketing.

Your task is to perform entity resolution, also known as record linkage, on these datasets. The goal is to identify which products in Retailer B's dataset are also products of Retailer A. This will allow the marketing department to create more personalized product offering campaigns and product indexes.

Data Source

Repo Structure

entity_resolution_clustering.ipynb: Contains the code with the application of entity resolution and clustering.
exploration_data_analysis.ipynb: Contains the analysis for choosing the best columns to compare.
retailerA.csv: Database from retailer A.
retailerB.csv: Database from retailer B.

Key Features

Data processing: We choose the column that doesn't need cleaning but we have to be aware of the structure of the information, in this case numbers are important to compare them between variables.
Model Vectorizing: For vectorizing the word values we used TfidfVectorizer.
Model Clustering: We used HDBSCAN because if we see this project as a pipeline we need the model to define the number of clusters.
Automation: All models are in functions.

Installation and Usage

Clone the repository.
Install dependencies: pip install -r requirements.txt.
Run the jupyter notebook.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
entity_resolution_clustering.ipynb		entity_resolution_clustering.ipynb
exploration_data_analysis.ipynb		exploration_data_analysis.ipynb
requirements.txt		requirements.txt
retailerA.csv		retailerA.csv
retailerB.csv		retailerB.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Record Linkage

Project Overview

Context

Data Source

Repo Structure

Key Features

Installation and Usage

License

About

Releases

Packages

Languages

License

despinoza119/Entity_Resolution_and_Clustering

Folders and files

Latest commit

History

Repository files navigation

Record Linkage

Project Overview

Context

Data Source

Repo Structure

Key Features

Installation and Usage

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages