Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved HuggingFace Connectors #612

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lordsoffallen
Copy link

Description

Current HF connectors are too limited so I end up creating my own most of the time.

Development notes

Similar code that I have been using in my projects.

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: ftopal <fazilbtopal@gmail.com>
@lordsoffallen
Copy link
Author

@merelcht

I have created this draft PR to start a discussion on some points. Let me know what you people think so I can update it accordingly. Right now, test, documentation is missing cuz I wanted to get the design first.

@merelcht
Copy link
Member

merelcht commented Mar 15, 2024

Thanks @lordsoffallen , I'll bring it up with the team 🙂 In the meantime could you maybe explain a bit more in what respect the other hugging face datasets are too limited?

@merelcht merelcht added the Community Issue/PR opened by the open-source community label Mar 15, 2024
@lordsoffallen
Copy link
Author

Thanks @lordsoffallen , I'll bring it up with the team 🙂 In the meantime could you maybe explain a bit more in what respect the other hugging face datasets are too limited?

Sure. Current ones don't support local load/save operation, in fact HFDataset don't support save at all so checkpointing datasets at different stages isn't doable as is in kedro. Other one which is transformer oriented is using pipeline only. Having model and tokenizer separately brings more functionality to user than pipeline itself.

I would say it would be good to think about the whole new pretrained model usecases in kedro. All of the connectors about the data and I am aware models somehow can be used in the same manner but it feels a bit more of hack. Perhaps we'd need something like kedro_models to put them all together? Bunch of tools to help manage models? Essentially when we work with huggingface models, a lot of configs are model related so organizing them in a catalog would be more kedro centric.

@merelcht
Copy link
Member

Thanks for sharing more context @lordsoffallen. I personally haven't worked with hugging face yet, but it does sound like kedro datasets aren't fully matching the models use case. Looping in @astrojuanlu who created the current hugging face dataset to hear his thoughts.

Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution @lordsoffallen and sorry it took me so much to get back! I have a couple of questions on the implementation and naming.

logger = logging.getLogger(__file__)


class HFTransformer(AbstractDataset):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about HFAutoTransformerDataset?

(About ...Dataset, I know it's not technically a dataset and that there's potential confusion with Hugging Face Datasets, but I'm wondering if we should keep consistency here. @merelcht ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'd prefer to keep it consistent even this isn't technically a dataset. HFAutoTransformerDataset sounds like a good name.

Comment on lines +67 to +71
if self.save_to_disk:
logger.info("Saving to local disk.")
data.save_to_disk(self._filepath)

if self.save_to_hub:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this branching here is kind of unusual. As an alternative, kedro-mlflow provides 2 different datasets for this purpose, MlflowModelTrackingDataset and MlflowModelLocalFileSystemDataset. Do you think we should do the same here @merelcht ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes looking at this it might be an idea to split this into a dataset that does loading & saving to disk and one that does it remotely.

@lordsoffallen
Copy link
Author

Hey @merelcht @astrojuanlu,

Sorry I also forgot about this for a while. Now getting back to it, I share my two cents here:

LLM workflows is tricky to use with kedro at the moment. Because I like the framework and i wanted to make it work with kedro. I believe suggestions are to make it fit into the current style but I believe for LLMs we need a different style than normal one. I think I will work on creating a extension kind of version as I don't see high prio for that on kedro roadmap atm. I will share the updates later on with the community in case anyone find it useful.

@astrojuanlu
Copy link
Member

@lordsoffallen Just yesterday I opened this discussion kedro-org/kedro#3979 I'd love to know what makes LLM workflows difficult in Kedro. Do you mind elaborating a bit more there?

@lordsoffallen
Copy link
Author

@lordsoffallen Just yesterday I opened this discussion kedro-org/kedro#3979 I'd love to know what makes LLM workflows difficult in Kedro. Do you mind elaborating a bit more there?

Sure, I'll share my workflow and how i image it there 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

3 participants