Requesting some flexibility in topic input #12

jmoney7823956789378 · 2023-07-26T03:12:18Z

I've been playing around quite a bit with airoboros, struggling to achieve my goal due to the nature of my "dataset". Since the public release of the GPT4 api to just about anyone, I'd like to try and run through my attempts again.
The dataset is a very large plaintext document transcript from a textbook, with a lot of content that requires context on surrounding lines.
I've had SOME success in just inputting the text file as the topics-path, but due to it only taking in one "topic" per line, a lot of the information and context is missed.
This might be a niche thing that only I will ever use, but is it possible to introduce a switch like --topic-grouping 5, which would group topic lines by every 5, and then send them through the prompt to be processed?

The text was updated successfully, but these errors were encountered:

jondurbin · 2023-07-27T06:59:06Z

I'm curious, what is the goal/use case for that text content? I ask because it may be something that would be better as a new "instructor", rather than re-using the topic list.

jmoney7823956789378 · 2023-07-27T09:54:57Z

I'm curious, what is the goal/use case for that text content? I ask because it may be something that would be better as a new "instructor", rather than re-using the topic list.

In the most optimal scenario, large portions of my textbook(s) are ingested at once and, rather than "create a topic based on this", it would ask GPT4 to simply make the Instruction+Response directly from the output.

Asking it to generate instructions and then following up with asking it to answer its own instructions is like a short game of telephone, and it causes some of the fine details in this textbook to be lost.

I can't share the intended training content publicly, but if you'd like, I can find a way to get it into your email if that interests you.

jondurbin · 2023-07-27T09:59:53Z

In the most optimal scenario, large portions of my textbook(s) are ingested at once and, rather than "create a topic based on this", it would ask GPT4 to simply make the Instruction+Response directly from the output.

So you want to send a snippet of text, and have it generate a series of questions about that text, then produce the answers based on the text?

jmoney7823956789378 · 2023-07-27T10:05:12Z

So you want to send a snippet of text, and have it generate a series of questions about that text, then produce the answers based on the text?

Perhaps it would require something like:

Feed block of text as input
GPT generates instruction based on given text
GPT generates response based on given text + original input.

Simply due to the nature of the data, a lot of the content covered is unlikely to be answerable soley by the data GPT4 has available.
This method, however, would use many more tokens than others.

I do also happen to have multiple-choice test questions + explanations that can be used as well. These could simply be tossed in and formatted without much transformation.

jondurbin mentioned this issue Aug 16, 2023

[Question] How to create instruction datasets based on domain specific information. #21

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requesting some flexibility in topic input #12

Requesting some flexibility in topic input #12

jmoney7823956789378 commented Jul 26, 2023

jondurbin commented Jul 27, 2023

jmoney7823956789378 commented Jul 27, 2023

jondurbin commented Jul 27, 2023

jmoney7823956789378 commented Jul 27, 2023 •

edited

Loading

Requesting some flexibility in topic input #12

Requesting some flexibility in topic input #12

Comments

jmoney7823956789378 commented Jul 26, 2023

jondurbin commented Jul 27, 2023

jmoney7823956789378 commented Jul 27, 2023

jondurbin commented Jul 27, 2023

jmoney7823956789378 commented Jul 27, 2023 • edited Loading

jmoney7823956789378 commented Jul 27, 2023 •

edited

Loading