[Question] How to create instruction datasets based on domain specific information. #21

interactivetech · 2023-08-16T16:07:41Z

Awesome work with airoboros! I am interested in creating datasets for instruction tuning based on domain specific content (API documentation, python codebase, PDFs). What would be the best way to provide documents and domain information to create instruction tuned datasets. Have you seen other codebases or papers that achieve this?

jondurbin · 2023-08-16T17:32:02Z

This isn't currently an option, see also related #12

I am, however, working on adding a version of this to this tool. It will be similar to what Meta did with Humpback - generate questions/instructions for which the response is the content you already have.

For example, you provide a section of code, it would generate an instruction similar to "write a python script that does [x]", so your content is the target response and the LLM will produce the questions/instructions. It won't be perfect, and it will be somewhat limited in scope initially because of how difficult it would be to properly segment raw data, but it will at least be something to start with.

interactivetech · 2023-08-16T17:57:08Z

Thanks for the info! Will close the issue.

interactivetech closed this as completed Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] How to create instruction datasets based on domain specific information. #21

[Question] How to create instruction datasets based on domain specific information. #21

interactivetech commented Aug 16, 2023

jondurbin commented Aug 16, 2023

interactivetech commented Aug 16, 2023

[Question] How to create instruction datasets based on domain specific information. #21

[Question] How to create instruction datasets based on domain specific information. #21

Comments

interactivetech commented Aug 16, 2023

jondurbin commented Aug 16, 2023

interactivetech commented Aug 16, 2023