Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added example notebook for translation with ct2 model. #262

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

uahmed93
Copy link

@uahmed93 uahmed93 commented Sep 25, 2024

Description

This PR is adding a translation example to hindi languge via ct2 model.
Issue here
This example depends on CrossFit's PR

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@uahmed93 uahmed93 force-pushed the uahmed/ct2_translation_example_notebook branch 2 times, most recently from 1b7e864 to 5ecd4fe Compare September 27, 2024 07:06
Signed-off-by: Ahmed Umair <ahmedumair.1993@gmail.com>
@uahmed93 uahmed93 force-pushed the uahmed/ct2_translation_example_notebook branch from 5ecd4fe to cef1c34 Compare October 2, 2024 13:52
Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a initial review, thanks for working on this tutorial.

tutorials/translation/ct2_hindi_translation.ipynb Outdated Show resolved Hide resolved
Comment on lines 505 to 511
"/home/uahmed/Desktop/NeMo-Curator/.ndc/lib/python3.10/site-packages/sklearn/base.py:376: InconsistentVersionWarning: Trying to unpickle estimator LinearRegression from version 1.4.2 when using version 1.5.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:\n",
"https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations\n",
" warnings.warn(\n",
"GPU: 0, Part: 0: 0%| | 0/10 [05:33<?, ?it/s]1.74s/it]\n",
"GPU: 0, Part: 0: 100%|██████████| 10/10 [00:17<00:00, 1.77s/it]\n"
]
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given its a tutorial, we should show the output that writing it out, maybe try doing both .

"source": [
"import os\n",
"\n",
"os.environ[\"DASK_DATAFRAME__QUERY_PLANNING\"] = \"False\"\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this, no need to add it now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this should be removed.

tutorials/translation/ct2_hindi_translation.ipynb Outdated Show resolved Hide resolved
"metadata": {},
"source": [
"# Indic Translation\n",
"This notebook demonstrate an example use of nemo-curator for Indic language generation via translation from English language. This workflow is accelarated by CrossFit, a library that leverages intellegent batching and RAPIDS to accelerate the offline inference on large datasets. \n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would suggest mentioning that we support large scale translation using multi node multi gpu setup

"# Indic Translation\n",
"This notebook demonstrate an example use of nemo-curator for Indic language generation via translation from English language. This workflow is accelarated by CrossFit, a library that leverages intellegent batching and RAPIDS to accelerate the offline inference on large datasets. \n",
"This example uses ctransalte2 model from [here](https://indictrans2-public.objectstore.e2enetworks.net/it2_preprint_ckpts/en-indic-preprint.zip), taken from IndicTrans2 github repo, [here](https://github.com/AI4Bharat/IndicTrans2/tree/main?tab=readme-ov-file#multilingual-translation-models)"
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a clear demarkation of import section here

tutorials/translation/ct2_hindi_translation.ipynb Outdated Show resolved Hide resolved
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define output directory\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"## Define output directory\n"
"##### Define output directory\n"

]
},
{
"cell_type": "markdown",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tutorials/translation/ct2_hindi_translation.ipynb Outdated Show resolved Hide resolved
Signed-off-by: Ahmed Umair <ahmedumair.1993@gmail.com>
@sarahyurick sarahyurick added the documentation Improvements or additions to documentation label Oct 7, 2024
@uahmed93
Copy link
Author

uahmed93 commented Oct 8, 2024

Hi @VibhuJawa / @sarahyurick ,
can you please merge this MR ?

Copy link
Collaborator

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can leave a more detailed review for focusing on grammar nits, but for now can you please implement the code changes from my review, then rerun the entire notebook (executing from start to bottom and in order)?

"source": [
"import os\n",
"\n",
"os.environ[\"DASK_DATAFRAME__QUERY_PLANNING\"] = \"False\"\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this should be removed.

"metadata": {},
"outputs": [],
"source": [
"@dataclass\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about removing this entire cell and creating a new script in https://github.com/NVIDIA/NeMo-Curator/tree/main/nemo_curator/classifiers?

Copy link
Collaborator

@VibhuJawa VibhuJawa Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cant really have another classifier just yet, we have to have this as a tutorial as we dont want to support this globally just yet. We can probably create a utils file here to make this cleaner though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok I see, using a utils file sounds good to me.

"metadata": {},
"outputs": [],
"source": [
"class IndicTranslation(DistributedDataClassifier):\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above, why don't we move all of this logic into https://github.com/NVIDIA/NeMo-Curator/tree/main/nemo_curator/classifiers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants