Skip to content
This repository has been archived by the owner on Nov 14, 2023. It is now read-only.

Keep fastq id #2

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Conversation

ilveroluca
Copy link
Contributor

I wanted to propose this change to the other developers. There are a couple of little problems that emerge when using FastqLoader and FastqStorer to manipulate fastq files whose id lines don't follow the Illumina format:

  1. there's no way to retrieve the original id line from the fastq records being read;
  2. there's no way to write an arbitrary fastq id line either.

An example where this an issue in practice, I wanted to convert base quality values from Illumina to Sanger encoding in a few gigabytes of fastq files provided by some sequencing center. The id lines were not in the "structured" Illumina format so I had no way to manipulate the data while keeping the same read ids. With this patch that's now possible.

For the most part, this change is backwards compatible. It adds a new field "id" to the tuples generated by FastqLoader, which can be easily ignored. There is one small thing though that might cause some surprises: if the "id" field exists in the tuple passed to FastqStorer, it will be used as the id line rather than constructing one from the meta data ("id" will be passed to FastqOutputFormat as the key, rather than passing null). I suspect this isn't a big issue, but I wanted to get some feedback before committing.

@AndreSchumacher
Copy link
Member

@ridvandongelci did you have a chance to look at this?

@ridvandongelci
Copy link
Contributor

Yes, It is ok for me. I was wondering what do you think.

@AndreSchumacher
Copy link
Member

It looks good to me except that previously we had Fastq and Qseq loader to use the same schema which simplified the manual (e.g., 5.2.3 FastqLoader and QseqLoader).

@ilveroluca any chance we could keep both loader/storer pairs similar by adding this to Qseq, too?

@ilveroluca
Copy link
Contributor Author

Ah...I didn't think of that. I could change the schemas used by QseqLoader
and QseqStorer. The question is, what should "id" mean for the qseq
format? It doesn't exist there.

An option would be to have QseqLoader create an artificial id by joining
the coordinate values (the first columns, up to the read number).
QseqStorer on the other hand would have to ignore the value and throw it
out.

I don't know...it feels like a wart. What if we argued that people should
be projecting their data anyways? :-)

On 24 July 2014 08:10, Andre Schumacher notifications@github.com wrote:

It looks good to me except that previously we had Fastq and Qseq loader to
use the same schema which simplified the manual (e.g., 5.2.3 FastqLoader
and QseqLoader).

@ilveroluca https://github.com/ilveroluca any chance we could keep both
loader/storer pairs similar by adding this to Qseq, too?


Reply to this email directly or view it on GitHub
#2 (comment).

@AndreSchumacher
Copy link
Member

OK, let's not then shoehorn this onto the Qseq if there is not ID. @ilveroluca if you could just update doc/seqpig_reference.tex to break out the Qseq loader/storer part and copy-paste that to its own section so that the manual stays accurate. Thanks!!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants