Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performances by having one assay by level #70

Closed
cpommier opened this issue Dec 2, 2020 · 32 comments · May be fixed by #71
Closed

Improve performances by having one assay by level #70

cpommier opened this issue Dec 2, 2020 · 32 comments · May be fixed by #71
Assignees

Comments

@cpommier
Copy link
Member

cpommier commented Dec 2, 2020

The current bottelneck is the number of assays. There is curently one assy per observationunit.
MIAPPE allows to have one assay per level.
By handling only one observation unit per level in brapi_to_isa line 95 : def create_study_sample_and_assaythe performances are dramatically improved.
But, the germplasm list isn't present in the study file anymore.

What we don't understand, is that the germplasms should be handled by the following piece of code, line 534:

                for germ in germplasms:
                    # Creating corresponding ISA biosources with is Creating isa characteristics from germplasm attributes.
                    # ------------------------------------------------------
                    source = Source(name=germ['germplasmName'], characteristics=converter.create_germplasm_chars(germ))
                    
                    if germ['germplasmDbId'] not in germplasminfo:
                        germplasminfo[germ['germplasmDbId']] = [germ['accessionNumber']]

                    # Associating ISA sources to ISA isa_study object
                    isa_study.sources.append(source)

What are we suppose to do to ensure that the germplams end in the right study charachteristics ?

@cpommier
Copy link
Member Author

cpommier commented Dec 2, 2020

@bedroesb and @proccaserra , what is your opinion on this ?

@bedroesb
Copy link
Member

bedroesb commented Dec 3, 2020

@cpommier Could you send the changes you propose in def create_study_sample_and_assay so I can have a look how to fix it? Because to be honest if I look now at the code it is difficult to see immediately why certain things were done (in a hacky way) :)

@proccaserra
Copy link
Collaborator

proccaserra commented Dec 7, 2020

@cpommier as far as I remember (It is already a long time ago), we had the discussion about the which 'observation level' to support from BRAPI to ISA. So we need to go back to the mapping between BRAPI and ISA entities.
I am confused by the discrepancy in Line numbers you reference, but I guess we are talking about the code which tries to create ISA Source and Sample based on BRAPI information obtained from GermplasmID associated with ObservationUnits , and then do a lookup on the characteristics (invoking create_germplasm_chars(germ) ).
It seems the refactoring now fails to retrieve the information.
But a call would be best to discuss this because too much time has past since I worked on this (I am not working with BRAPI every day and my memory needs to be jogged to get the BRAPI model back!)

@cpommier
Copy link
Member Author

cpommier commented Dec 7, 2020

Thanks @bedroesb and @proccaserra !
The changes I have in mind are not implemented yet, I am still cautious, because they might violate the isatools model. Therefore I am sorry but there is nothing @erikkimmel or me could send you :(

A quick call might indeed be better, I'll send you both an email and we will see what we can do, thanks!

Curently, the code works, but is not too slow.
The curent implementation pseudo code is the following

for germ in  germplasm:
    study.source.add(germ.characteristics)

create_study_sample_and_assay (study, observation_unit)

def create_study_sample_and_assay(study, observation_unit):
  for obs_unit in  observation_unit:
      study.assays.add(create_sample_from_obs_unit (obs_unit))

I hope I am not oversimplyfing things.
The problem is that only the sources that are attached to an assay are added in the study file. I would like to dump all the sources even if they are not explicitely used in an assay. Is that possible @proccaserra or did I misunderstood isatools logic ?
What I would like to do is more in the line of :

for germ in  germplasm:
    study.source.add(germ.characteristics)

create_study_sample_and_assay (study, observation_unit)

def create_study_sample_and_assay(study, observation_units):

     # the following shoudl'nt be attached to any specific source. Or more precisely, it should be attached to all of them.
      study.assays.add(create_one_assay_per_level) # the level is plant, micro plot, block, etc...
      create_data_file(observation_units)

Does that make sense ?

@proccaserra
Copy link
Collaborator

@cpommier, I ran a quick test on a notebook with the isa-api. If an isa.Source is declared but not used in any protocol to create another materials, even though the isa.Source and associated to a Study, the serialisation to ISA-tab does not write that source. The ISA-JSON serialisation is fine shows the unused Source. So it seems we need to look at the ISA-Tab serialisation code. @djcomlab @Zigur

@cpommier
Copy link
Member Author

cpommier commented Dec 7, 2020

Thanks a lot @proccaserra

@djcomlab
Copy link

djcomlab commented Dec 7, 2020

So this is kind of expected behaviour and down to design decisions in ISA API.

The ISA-Tab tables serialization focuses on describing the experimental workflow, and if there's a source declared that is not used in a workflow, then the assumption is has been that it has not used in the experiment and therefore not rendered in a table.

ISA-JSON, however, has no such constraints as the data model has the structures to cleanly hold sources independent of the experimental workflow.

@cpommier
Copy link
Member Author

cpommier commented Dec 8, 2020

Thanks for the clarification @djcomlab . I understand the logic.
Would it be possible to trigger the ISA-JSON beaviour in ISA-Tab dumps using a dedicated option ?

@djcomlab
Copy link

djcomlab commented Dec 8, 2020

No, I don't think this would work as things are at the moment.

The problem is that in ISA-Tab there is no separate "container" for material nodes (Source, Sample, other stuff) to exist independently from the experimental workflows.

It is of course possible to just add sources into a study table file with no linked processes and derived materials or data etc. but in ISA API the design assumption has been that this would not be correct, and I suspect this raises errors at validation time (disclaimer: but I have not tried it!).

@cpommier
Copy link
Member Author

cpommier commented Dec 8, 2020

Ok, but would it be possible to attach all the sources to a single assay ?

@cpommier
Copy link
Member Author

cpommier commented Dec 8, 2020

In https://github.com/elixir-europe/plant-brapi-to-isa/blob/master/brapi_to_isa.py lines 103 and following, we create one sample attached to one source, that gets attached to the study line 154

 isa_study.samples.append(this_isa_sample)

Then it get attached to a dedicated assay again line 168: isa_study.assays[i].samples.append(this_isa_sample)

If we have only ` isa_study.samples.append(this_isa_sample) and create a single assay, woud that work ? @erikkimmel, what do you think ?

@djcomlab
Copy link

djcomlab commented Dec 8, 2020

For the sources and samples to appear in the tables they must be associated with a process.

e.g. at the study-level, source -> process -> sample must be described for it to appear in the study table.

Similarly to the sources in the sources container in the model and ISA-JSON not coming out in the ISA-Tab study tables, samples must also be part of an experimental workflow (i.e. part of a process sequence) to appear in the assay table files. @proccaserra can probably elaborate on this.

@cpommier
Copy link
Member Author

cpommier commented Dec 8, 2020

Thanks, that clarifies the choices that were made

@proccaserra
Copy link
Collaborator

@cpommier so I ran a few more test this morning, writing up variations of ISA objects and now confirm that:

  • unused ISA.Sources are not written to ISA-Tab but are seriliazed to ISA-JSON (as mentioned yesterday).
  • an ISA-Tab 'manually/externally' augmented with unused ISA.Sources can be validated. to be precise, the validation process completes without returning errors, just throws a couple of warnings.
  • while the above is possible, the value would be limited since any read in memory and serialisation back to tab would lose the 'unused' Sources.
  • trying to build a dummy assay without linking processes will fail to write the ISA assay table since the algorithm can not find paths (this is the expected behaviour, and also why no changes were made as it would require some significant changes (afaict).this would rule out your suggestion @cpommier

I'll document all that in a jupyter-notebook

@proccaserra
Copy link
Collaborator

@cpommier a suggestion: if brapi2isa code can identify 'unused Sources', then a possible solution would be to create "dummy" Samples so the ISA-Tab serialisation dumps the ISA.Source Information while clearly marking the ISA.Sample as "dummies". This could be done using a special string (e.g. <NOT_USED> string as value for the Sample Name), complemented by the addition of an ISA.Comment element qualifying said ISA.Sample object to make it clear to human agent.
Of course, this is 'sugar' and other tools supporting the format wouldn't know how to deal with that but it would possibly address your needs.

@djcomlab
Copy link

A question:

Are the source->sample, and sample->data workflows basically "flat" in all cases here?

From what I can see in the code it looks like there is no branching or pooling. The ISA API's ability to deal with complex graphs is what slows it down.

@proccaserra
Copy link
Collaborator

@djcomlab, based on the set of converted brapi studies, the graph complexity is very low. The amount of information in the ISA assay table is actually very low.

@djcomlab
Copy link

djcomlab commented Dec 16, 2020

@proccaserra That's what I figured. I could probably put together a fast custom ISA-Tab writer just for this, until we address the overarching ISA-API performance. The idea being that ISA API is more like the Swiss Army knife that we can use to do many things with, but not necessarily all of these things optimally in all cases, but when we want to do one specific kind of task on a larger scale, we need to craft something optimised the task.

@cpommier
Copy link
Member Author

@proccaserra @djcomlab , thanks! We had the same idea @erikkimmel and I and even tried an implementation, but it prooved to be too much work for us since we don't know the isatool library as well as you.
If you want to do a fast custom ISA-Tab writer we would be very happy to test it, either on the brapi2isa master branch or on our tentative optimisation branch (https://github.com/elixir-europe/plant-brapi-to-isa/tree/test/single_assay_by_level). In parralel, we could try to review that branch with @bedroesb . What do you think, should we have a chat before begining ?

Anyway, thanks a lot for proposing this optimised writer, it would be highly usefull!

@erikkimmel
Copy link
Collaborator

Good morning all,
With the code we modified with @cpommier on BrAPI2ISA in this branch, I ran some performance tests.
On a medium-sized dataset, I obtained a gain of around 25% (4h -> 3h).
Have you had time to look for the possible writing of a custom ISA dumper?
Thank you in advance for your answers.
Cheers

@cpommier
Copy link
Member Author

cpommier commented Mar 5, 2021

We just reviewed the code and here is the current logic and todo list:

  • For each study there is one assay per level, ie datafile. For instance, one assay for plant level measures and one for micro plot level measures.
  • Check sufficient performances improvement with the above modifications.
  • Add the treatment columns in the datafile, this information is currently lost. As well as the coordinates.
  • Make a decision on the sample in the assay file. We need a pseudo sample for representing each of the levels. This might be a blocker.
  • Finalize the sample naming in the assay file.

@erikkimmel
Copy link
Collaborator

Hi,
Here are the command lines I used to run my tests. This could be usefull if you want to run them on your own side. I run each command one time on the BrAPI2ISA master branch and on test/single_assay_by_level branch the second time.

  • very small dataset
python3 ./plant-brapi-to-isa/brapi_to_isa.py -J -V -e https://urgi.versailles.inra.fr/faidare/brapi/v1/ -t dXJuOlVSR0kvdHJpYWwvMjM=

# on master
real	1m45.905s
user	1m38.270s
sys	0m0.927s

# on test/single_assay_by_level
real	1m30.787s
user	1m22.485s
sys	0m0.878s
  • bigger dataset
python3 ./plant-brapi-to-isa/brapi_to_isa.py -J -V -e https://urgi.versailles.inra.fr/faidare/brapi/v1/ -t aHR0cHM6Ly9kb2kub3JnLzEwLjE1NDU0L0lBU1NUTg==

# on master
real    276m49,662s                                                                                                                                                                                                                            user    267m20,711s                                                                                                                                                                                                                            sys     0m22,956s

# on test/single_assay_by_level
real    186m11,643s                                                                                                                                                                                                                            user    177m29,317s                                                                                                                                                                                                                            sys     0m7,336s

Regards

@bedroesb
Copy link
Member

bedroesb commented Mar 5, 2021

@erikkimmel which isa-tools version did you use?

@erikkimmel
Copy link
Collaborator

@bedroesb :

$ pip3 freeze | grep isatools
isatools==0.11.0

@bedroesb
Copy link
Member

bedroesb commented Mar 5, 2021

with isa-tools 12 i hope to finally solve #62 and to have improved isa json dumping, lets see! Do we after testing on the VIB dataset merge the test/single_assay_by_level branch into master ? or do you want to work on the other points first? @erikkimmel

@cpommier
Copy link
Member Author

cpommier commented Mar 5, 2021

the changes proposed in test/single_assay_by_level could impove performances but have unwanted side effects. We should I think first check the impact of isatools 12, then asses the impact of this branch on metadata. It is still work in progress and there might be undesirable side effects. We can cope with it as long as there is a significative speed imporvment between master and test/single_assay_by_level. If that's not the case, the fast dumper suggested by @djcomlab might be more advisable.
Thanks for your interest and help/contribution :)

@erikkimmel
Copy link
Collaborator

@bedroesb, @cpommier I started to re-run my tests with the new 0.12.0 version of isatools. I'll keep you informed.

@erikkimmel
Copy link
Collaborator

erikkimmel commented Mar 16, 2021

@bedroesb, @cpommier here are the results of my tests using the 0.12.0 version of isatools:
I used the following command:

python3 ./plant-brapi-to-isa/brapi_to_isa.py -J -V -e https://urgi.versailles.inra.fr/faidare/brapi/v1/ -t aHR0cHM6Ly9kb2kub3JnLzEwLjE1NDU0L0lBU1NUTg==
  • on BrAPI2ISA master branch:
real    191m50,114s
user    183m22,339s
sys     0m7,436s
  • on test/single_assay_by_level branch:
real    143m33,114s
user    135m31,186s
sys     0m4,507s

Cheers

@cpommier
Copy link
Member Author

Thanks Erik. isatools 0.12 is definitely an improvement. The single assay by level approach can take advantage of it, but 143m is still rather long. This is only a 3 year, 12 locations, 300 samples dataset. Extract it in a few minutes, without validation, would be really a great thing.
Considering the situation, reviving the fast writer hypothesis, even if it is not that easy to write, could be the most profitable approach. What do you think @proccaserra @djcomlab ?

@erikkimmel
Copy link
Collaborator

I ran the tests on the master branch of BrAPI2ISA with the 0.13-rc2 version of isatools:

time python3.8 ./plant-brapi-to-isa/brapi_to_isa.py -J -V -e https://urgi.versailles.inra.fr/faidare/brapi/v1/ -t aHR0cHM6Ly9kb2kub3JnLzEwLjE1NDU0L0lBU1NUTg==
[...]
real    8m33,605s
user    1m3,299s
sys     0m2,000s

@bedroesb
Copy link
Member

bedroesb commented Oct 5, 2021

this is such great news! Thanks for testing

@cpommier
Copy link
Member Author

The one assay per level seems to be a bad approach. Performances improvement is still important, but probably with other approach, using latest isatools mainly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants