-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved HuggingFace Connectors #612
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: ftopal <fazilbtopal@gmail.com>
I have created this draft PR to start a discussion on some points. Let me know what you people think so I can update it accordingly. Right now, test, documentation is missing cuz I wanted to get the design first. |
Thanks @lordsoffallen , I'll bring it up with the team 🙂 In the meantime could you maybe explain a bit more in what respect the other hugging face datasets are too limited? |
Sure. Current ones don't support local load/save operation, in fact HFDataset don't support save at all so checkpointing datasets at different stages isn't doable as is in kedro. Other one which is transformer oriented is using pipeline only. Having model and tokenizer separately brings more functionality to user than pipeline itself. I would say it would be good to think about the whole new pretrained model usecases in kedro. All of the connectors about the data and I am aware models somehow can be used in the same manner but it feels a bit more of hack. Perhaps we'd need something like |
Thanks for sharing more context @lordsoffallen. I personally haven't worked with hugging face yet, but it does sound like kedro datasets aren't fully matching the models use case. Looping in @astrojuanlu who created the current hugging face dataset to hear his thoughts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this contribution @lordsoffallen and sorry it took me so much to get back! I have a couple of questions on the implementation and naming.
logger = logging.getLogger(__file__) | ||
|
||
|
||
class HFTransformer(AbstractDataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about HFAutoTransformerDataset
?
(About ...Dataset
, I know it's not technically a dataset and that there's potential confusion with Hugging Face Datasets, but I'm wondering if we should keep consistency here. @merelcht ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'd prefer to keep it consistent even this isn't technically a dataset. HFAutoTransformerDataset
sounds like a good name.
if self.save_to_disk: | ||
logger.info("Saving to local disk.") | ||
data.save_to_disk(self._filepath) | ||
|
||
if self.save_to_hub: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this branching here is kind of unusual. As an alternative, kedro-mlflow
provides 2 different datasets for this purpose, MlflowModelTrackingDataset
and MlflowModelLocalFileSystemDataset
. Do you think we should do the same here @merelcht ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes looking at this it might be an idea to split this into a dataset that does loading & saving to disk and one that does it remotely.
Hey @merelcht @astrojuanlu, Sorry I also forgot about this for a while. Now getting back to it, I share my two cents here: LLM workflows is tricky to use with kedro at the moment. Because I like the framework and i wanted to make it work with kedro. I believe suggestions are to make it fit into the current style but I believe for LLMs we need a different style than normal one. I think I will work on creating a extension kind of version as I don't see high prio for that on kedro roadmap atm. I will share the updates later on with the community in case anyone find it useful. |
@lordsoffallen Just yesterday I opened this discussion kedro-org/kedro#3979 I'd love to know what makes LLM workflows difficult in Kedro. Do you mind elaborating a bit more there? |
Sure, I'll share my workflow and how i image it there 👍 |
Description
Current HF connectors are too limited so I end up creating my own most of the time.
Development notes
Similar code that I have been using in my projects.
Checklist
RELEASE.md
file