Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] How to create instruction datasets based on domain specific information. #21

Closed
interactivetech opened this issue Aug 16, 2023 · 2 comments

Comments

@interactivetech
Copy link

Awesome work with airoboros! I am interested in creating datasets for instruction tuning based on domain specific content (API documentation, python codebase, PDFs). What would be the best way to provide documents and domain information to create instruction tuned datasets. Have you seen other codebases or papers that achieve this?

@jondurbin
Copy link
Owner

This isn't currently an option, see also related #12

I am, however, working on adding a version of this to this tool. It will be similar to what Meta did with Humpback - generate questions/instructions for which the response is the content you already have.

For example, you provide a section of code, it would generate an instruction similar to "write a python script that does [x]", so your content is the target response and the LLM will produce the questions/instructions. It won't be perfect, and it will be somewhat limited in scope initially because of how difficult it would be to properly segment raw data, but it will at least be something to start with.

@interactivetech
Copy link
Author

Thanks for the info! Will close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants