Skip to content

mikemwai/massive

Repository files navigation

Massive

This project aims to process the massive dataset, focusing on generating language-specific files, such as en-xx.xlsx, for multiple languages and creating separate JSONL files for English (en), Swahili (sw), and German (de) with test, train, and dev data. Additionally, it will generate a single JSON file that contains translations from English to all languages with id and utt for the training sets. This project is designed to efficiently handle the dataset without using recursive algorithms to avoid potential memory and time complexity issues.

Prerequisites

  • Python version 3.11.5
  • PyCharm version 2023.2.1

Installation

  1. Clone the repository on your local machine:
  git clone https://github.com/mikemwai/massive.git
  1. Navigate to the project directory and create a virtual environment on your local machine through the command line:
  py -m venv myenv
  1. Activate your virtual environment:
  • On Windows:
  myenv\Scripts\activate
  • On Mac:
  source myenv/bin/activate
  1. Install project dependencies on your virtual environment:
  pip install -r requirements.txt
  1. Extract the dataset folder on your project folder. Use winrar to extract.

Usage

Run the project on the IDE terminal:

   python main.py generate_excel_files separate_files train_translations

Contributions

If you'd like to contribute to this project:

  • Please fork the repository
  • Create a new branch for your changes
  • Submit a pull request

Contributions, bug reports, and feature requests are welcome!

Issues

If you have any issues with the project, feel free to open up an issue.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Amazon massive dataset processing project.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published