In distributed work environments, it's common for multiple individuals to work on the same topic or dataset, resulting in the need to collate insights and inputs from various sources. Manually identifying and removing duplicate points from a corpus of sentences can be time-consuming and error-prone. To streamline this process, I developed a program that leverages Cohere embeddings to automatically identify and eliminate duplicate points among a collection of sentences.
This program calculates the semantic similarity (similarity in meaning) between sentences and outputs a similarity percentage. A similarity percentage of 100% indicates exact similarity in meaning. By utilizing this similarity metric, the tool helps identify and flag duplicate points, contributing to enhanced productivity and streamlined content aggregation.
- Automatic identification of duplicate points among a set of sentences.
- Utilizes Cohere embeddings to measure semantic similarity.
- Outputs a similarity percentage to quantify the degree of similarity.
- Enhances productivity by reducing the need for manual duplicate removal.
- Easy integration into your existing workflow.
- Visit the live demo.
- Upload your Excel file containing sentences in a column named 'Text'.
- Let the program calculate semantic similarity and generate a similarity matrix.
- Review the similarity percentages to identify and address duplicate points.
Input Sentences:
- "The quick brown fox jumps over the lazy dog."
- "A fast brown fox jumps over a lazy canine."
- "An agile fox leaps over the inactive dog."
Output Similarity Matrix:
Sentence 1 | Sentence 2 | Sentence 3 | |
---|---|---|---|
Sentence 1 | 100.00 | 82.53 | 78.90 |
Sentence 2 | 82.53 | 100.00 | 84.22 |
Sentence 3 | 78.90 | 84.22 | 100.00 |
Contributions are welcome! If you find any issues or have suggestions, feel free to submit a pull request or create an issue.
This project is licensed under the MIT License.
Disclaimer: This project is for educational purposes and not intended for production use.