You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Awesome work with airoboros! I am interested in creating datasets for instruction tuning based on domain specific content (API documentation, python codebase, PDFs). What would be the best way to provide documents and domain information to create instruction tuned datasets. Have you seen other codebases or papers that achieve this?
The text was updated successfully, but these errors were encountered:
This isn't currently an option, see also related #12
I am, however, working on adding a version of this to this tool. It will be similar to what Meta did with Humpback - generate questions/instructions for which the response is the content you already have.
For example, you provide a section of code, it would generate an instruction similar to "write a python script that does [x]", so your content is the target response and the LLM will produce the questions/instructions. It won't be perfect, and it will be somewhat limited in scope initially because of how difficult it would be to properly segment raw data, but it will at least be something to start with.
Awesome work with airoboros! I am interested in creating datasets for instruction tuning based on domain specific content (API documentation, python codebase, PDFs). What would be the best way to provide documents and domain information to create instruction tuned datasets. Have you seen other codebases or papers that achieve this?
The text was updated successfully, but these errors were encountered: