Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requesting some flexibility in topic input #12

Open
jmoney7823956789378 opened this issue Jul 26, 2023 · 4 comments
Open

Requesting some flexibility in topic input #12

jmoney7823956789378 opened this issue Jul 26, 2023 · 4 comments

Comments

@jmoney7823956789378
Copy link

I've been playing around quite a bit with airoboros, struggling to achieve my goal due to the nature of my "dataset". Since the public release of the GPT4 api to just about anyone, I'd like to try and run through my attempts again.
The dataset is a very large plaintext document transcript from a textbook, with a lot of content that requires context on surrounding lines.
I've had SOME success in just inputting the text file as the topics-path, but due to it only taking in one "topic" per line, a lot of the information and context is missed.
This might be a niche thing that only I will ever use, but is it possible to introduce a switch like --topic-grouping 5, which would group topic lines by every 5, and then send them through the prompt to be processed?

@jondurbin
Copy link
Owner

I'm curious, what is the goal/use case for that text content? I ask because it may be something that would be better as a new "instructor", rather than re-using the topic list.

@jmoney7823956789378
Copy link
Author

I'm curious, what is the goal/use case for that text content? I ask because it may be something that would be better as a new "instructor", rather than re-using the topic list.

In the most optimal scenario, large portions of my textbook(s) are ingested at once and, rather than "create a topic based on this", it would ask GPT4 to simply make the Instruction+Response directly from the output.

Asking it to generate instructions and then following up with asking it to answer its own instructions is like a short game of telephone, and it causes some of the fine details in this textbook to be lost.

I can't share the intended training content publicly, but if you'd like, I can find a way to get it into your email if that interests you.

@jondurbin
Copy link
Owner

In the most optimal scenario, large portions of my textbook(s) are ingested at once and, rather than "create a topic based on this", it would ask GPT4 to simply make the Instruction+Response directly from the output.

So you want to send a snippet of text, and have it generate a series of questions about that text, then produce the answers based on the text?

@jmoney7823956789378
Copy link
Author

jmoney7823956789378 commented Jul 27, 2023

So you want to send a snippet of text, and have it generate a series of questions about that text, then produce the answers based on the text?

Perhaps it would require something like:

  1. Feed block of text as input
  2. GPT generates instruction based on given text
  3. GPT generates response based on given text + original input.

Simply due to the nature of the data, a lot of the content covered is unlikely to be answerable soley by the data GPT4 has available.
This method, however, would use many more tokens than others.

I do also happen to have multiple-choice test questions + explanations that can be used as well. These could simply be tossed in and formatted without much transformation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants