Improve performances by having one assay by level #70

cpommier · 2020-12-02T10:24:26Z

The current bottelneck is the number of assays. There is curently one assy per observationunit.
MIAPPE allows to have one assay per level.
By handling only one observation unit per level in brapi_to_isa line 95 : def create_study_sample_and_assaythe performances are dramatically improved.
But, the germplasm list isn't present in the study file anymore.

What we don't understand, is that the germplasms should be handled by the following piece of code, line 534:

                for germ in germplasms:
                    # Creating corresponding ISA biosources with is Creating isa characteristics from germplasm attributes.
                    # ------------------------------------------------------
                    source = Source(name=germ['germplasmName'], characteristics=converter.create_germplasm_chars(germ))
                    
                    if germ['germplasmDbId'] not in germplasminfo:
                        germplasminfo[germ['germplasmDbId']] = [germ['accessionNumber']]

                    # Associating ISA sources to ISA isa_study object
                    isa_study.sources.append(source)

What are we suppose to do to ensure that the germplams end in the right study charachteristics ?

The text was updated successfully, but these errors were encountered:

cpommier · 2020-12-02T10:24:57Z

@bedroesb and @proccaserra , what is your opinion on this ?

bedroesb · 2020-12-03T12:45:29Z

@cpommier Could you send the changes you propose in def create_study_sample_and_assay so I can have a look how to fix it? Because to be honest if I look now at the code it is difficult to see immediately why certain things were done (in a hacky way) :)

proccaserra · 2020-12-07T15:32:07Z

@cpommier as far as I remember (It is already a long time ago), we had the discussion about the which 'observation level' to support from BRAPI to ISA. So we need to go back to the mapping between BRAPI and ISA entities.
I am confused by the discrepancy in Line numbers you reference, but I guess we are talking about the code which tries to create ISA Source and Sample based on BRAPI information obtained from GermplasmID associated with ObservationUnits , and then do a lookup on the characteristics (invoking create_germplasm_chars(germ) ).
It seems the refactoring now fails to retrieve the information.
But a call would be best to discuss this because too much time has past since I worked on this (I am not working with BRAPI every day and my memory needs to be jogged to get the BRAPI model back!)

cpommier · 2020-12-07T17:12:22Z

Thanks @bedroesb and @proccaserra !
The changes I have in mind are not implemented yet, I am still cautious, because they might violate the isatools model. Therefore I am sorry but there is nothing @erikkimmel or me could send you :(

A quick call might indeed be better, I'll send you both an email and we will see what we can do, thanks!

Curently, the code works, but is not too slow.
The curent implementation pseudo code is the following

for germ in  germplasm:
    study.source.add(germ.characteristics)

create_study_sample_and_assay (study, observation_unit)

def create_study_sample_and_assay(study, observation_unit):
  for obs_unit in  observation_unit:
      study.assays.add(create_sample_from_obs_unit (obs_unit))

I hope I am not oversimplyfing things.
The problem is that only the sources that are attached to an assay are added in the study file. I would like to dump all the sources even if they are not explicitely used in an assay. Is that possible @proccaserra or did I misunderstood isatools logic ?
What I would like to do is more in the line of :

for germ in  germplasm:
    study.source.add(germ.characteristics)

create_study_sample_and_assay (study, observation_unit)

def create_study_sample_and_assay(study, observation_units):

     # the following shoudl'nt be attached to any specific source. Or more precisely, it should be attached to all of them.
      study.assays.add(create_one_assay_per_level) # the level is plant, micro plot, block, etc...
      create_data_file(observation_units)

Does that make sense ?

proccaserra · 2020-12-07T21:19:43Z

@cpommier, I ran a quick test on a notebook with the isa-api. If an isa.Source is declared but not used in any protocol to create another materials, even though the isa.Source and associated to a Study, the serialisation to ISA-tab does not write that source. The ISA-JSON serialisation is fine shows the unused Source. So it seems we need to look at the ISA-Tab serialisation code. @djcomlab @Zigur

cpommier · 2020-12-07T21:24:33Z

Thanks a lot @proccaserra

djcomlab · 2020-12-07T21:29:20Z

So this is kind of expected behaviour and down to design decisions in ISA API.

The ISA-Tab tables serialization focuses on describing the experimental workflow, and if there's a source declared that is not used in a workflow, then the assumption is has been that it has not used in the experiment and therefore not rendered in a table.

ISA-JSON, however, has no such constraints as the data model has the structures to cleanly hold sources independent of the experimental workflow.

cpommier · 2020-12-08T09:34:24Z

Thanks for the clarification @djcomlab . I understand the logic.
Would it be possible to trigger the ISA-JSON beaviour in ISA-Tab dumps using a dedicated option ?

djcomlab · 2020-12-08T09:48:45Z

No, I don't think this would work as things are at the moment.

The problem is that in ISA-Tab there is no separate "container" for material nodes (Source, Sample, other stuff) to exist independently from the experimental workflows.

It is of course possible to just add sources into a study table file with no linked processes and derived materials or data etc. but in ISA API the design assumption has been that this would not be correct, and I suspect this raises errors at validation time (disclaimer: but I have not tried it!).

cpommier · 2020-12-08T11:17:37Z

Ok, but would it be possible to attach all the sources to a single assay ?

cpommier · 2020-12-08T11:24:36Z

In https://github.com/elixir-europe/plant-brapi-to-isa/blob/master/brapi_to_isa.py lines 103 and following, we create one sample attached to one source, that gets attached to the study line 154

 isa_study.samples.append(this_isa_sample)

Then it get attached to a dedicated assay again line 168: isa_study.assays[i].samples.append(this_isa_sample)

If we have only ` isa_study.samples.append(this_isa_sample) and create a single assay, woud that work ? @erikkimmel, what do you think ?

djcomlab · 2020-12-08T11:35:17Z

For the sources and samples to appear in the tables they must be associated with a process.

e.g. at the study-level, source -> process -> sample must be described for it to appear in the study table.

Similarly to the sources in the sources container in the model and ISA-JSON not coming out in the ISA-Tab study tables, samples must also be part of an experimental workflow (i.e. part of a process sequence) to appear in the assay table files. @proccaserra can probably elaborate on this.

cpommier · 2020-12-08T11:41:38Z

Thanks, that clarifies the choices that were made

proccaserra · 2020-12-08T13:30:32Z

@cpommier so I ran a few more test this morning, writing up variations of ISA objects and now confirm that:

unused ISA.Sources are not written to ISA-Tab but are seriliazed to ISA-JSON (as mentioned yesterday).
an ISA-Tab 'manually/externally' augmented with unused ISA.Sources can be validated. to be precise, the validation process completes without returning errors, just throws a couple of warnings.
while the above is possible, the value would be limited since any read in memory and serialisation back to tab would lose the 'unused' Sources.
trying to build a dummy assay without linking processes will fail to write the ISA assay table since the algorithm can not find paths (this is the expected behaviour, and also why no changes were made as it would require some significant changes (afaict).this would rule out your suggestion @cpommier

I'll document all that in a jupyter-notebook

proccaserra · 2020-12-09T14:14:19Z

@cpommier a suggestion: if brapi2isa code can identify 'unused Sources', then a possible solution would be to create "dummy" Samples so the ISA-Tab serialisation dumps the ISA.Source Information while clearly marking the ISA.Sample as "dummies". This could be done using a special string (e.g. <NOT_USED> string as value for the Sample Name), complemented by the addition of an ISA.Comment element qualifying said ISA.Sample object to make it clear to human agent.
Of course, this is 'sugar' and other tools supporting the format wouldn't know how to deal with that but it would possibly address your needs.

djcomlab · 2020-12-15T15:51:29Z

A question:

Are the source->sample, and sample->data workflows basically "flat" in all cases here?

From what I can see in the code it looks like there is no branching or pooling. The ISA API's ability to deal with complex graphs is what slows it down.

proccaserra · 2020-12-16T08:53:43Z

@djcomlab, based on the set of converted brapi studies, the graph complexity is very low. The amount of information in the ISA assay table is actually very low.

djcomlab · 2020-12-16T09:25:18Z

@proccaserra That's what I figured. I could probably put together a fast custom ISA-Tab writer just for this, until we address the overarching ISA-API performance. The idea being that ISA API is more like the Swiss Army knife that we can use to do many things with, but not necessarily all of these things optimally in all cases, but when we want to do one specific kind of task on a larger scale, we need to craft something optimised the task.

cpommier · 2021-01-11T08:37:58Z

@proccaserra @djcomlab , thanks! We had the same idea @erikkimmel and I and even tried an implementation, but it prooved to be too much work for us since we don't know the isatool library as well as you.
If you want to do a fast custom ISA-Tab writer we would be very happy to test it, either on the brapi2isa master branch or on our tentative optimisation branch (https://github.com/elixir-europe/plant-brapi-to-isa/tree/test/single_assay_by_level). In parralel, we could try to review that branch with @bedroesb . What do you think, should we have a chat before begining ?

Anyway, thanks a lot for proposing this optimised writer, it would be highly usefull!

erikkimmel · 2021-02-17T14:48:05Z

Good morning all,
With the code we modified with @cpommier on BrAPI2ISA in this branch, I ran some performance tests.
On a medium-sized dataset, I obtained a gain of around 25% (4h -> 3h).
Have you had time to look for the possible writing of a custom ISA dumper?
Thank you in advance for your answers.
Cheers

cpommier · 2021-03-05T11:11:23Z

We just reviewed the code and here is the current logic and todo list:

For each study there is one assay per level, ie datafile. For instance, one assay for plant level measures and one for micro plot level measures.
Check sufficient performances improvement with the above modifications.
Add the treatment columns in the datafile, this information is currently lost. As well as the coordinates.
Make a decision on the sample in the assay file. We need a pseudo sample for representing each of the levels. This might be a blocker.
Finalize the sample naming in the assay file.

erikkimmel · 2021-03-05T13:56:17Z

Hi,
Here are the command lines I used to run my tests. This could be usefull if you want to run them on your own side. I run each command one time on the BrAPI2ISA master branch and on test/single_assay_by_level branch the second time.

very small dataset

python3 ./plant-brapi-to-isa/brapi_to_isa.py -J -V -e https://urgi.versailles.inra.fr/faidare/brapi/v1/ -t dXJuOlVSR0kvdHJpYWwvMjM=

# on master
real	1m45.905s
user	1m38.270s
sys	0m0.927s

# on test/single_assay_by_level
real	1m30.787s
user	1m22.485s
sys	0m0.878s

bigger dataset

python3 ./plant-brapi-to-isa/brapi_to_isa.py -J -V -e https://urgi.versailles.inra.fr/faidare/brapi/v1/ -t aHR0cHM6Ly9kb2kub3JnLzEwLjE1NDU0L0lBU1NUTg==

# on master
real    276m49,662s                                                                                                                                                                                                                            user    267m20,711s                                                                                                                                                                                                                            sys     0m22,956s

# on test/single_assay_by_level
real    186m11,643s                                                                                                                                                                                                                            user    177m29,317s                                                                                                                                                                                                                            sys     0m7,336s

Regards

bedroesb · 2021-03-05T14:34:44Z

@erikkimmel which isa-tools version did you use?

erikkimmel · 2021-03-05T14:39:09Z

@bedroesb :

$ pip3 freeze | grep isatools
isatools==0.11.0

bedroesb · 2021-03-05T14:45:02Z

with isa-tools 12 i hope to finally solve #62 and to have improved isa json dumping, lets see! Do we after testing on the VIB dataset merge the test/single_assay_by_level branch into master ? or do you want to work on the other points first? @erikkimmel

cpommier · 2021-03-05T14:52:46Z

the changes proposed in test/single_assay_by_level could impove performances but have unwanted side effects. We should I think first check the impact of isatools 12, then asses the impact of this branch on metadata. It is still work in progress and there might be undesirable side effects. We can cope with it as long as there is a significative speed imporvment between master and test/single_assay_by_level. If that's not the case, the fast dumper suggested by @djcomlab might be more advisable.
Thanks for your interest and help/contribution :)

erikkimmel · 2021-03-08T14:44:18Z

@bedroesb, @cpommier I started to re-run my tests with the new 0.12.0 version of isatools. I'll keep you informed.

erikkimmel · 2021-03-16T14:15:14Z

@bedroesb, @cpommier here are the results of my tests using the 0.12.0 version of isatools:
I used the following command:

python3 ./plant-brapi-to-isa/brapi_to_isa.py -J -V -e https://urgi.versailles.inra.fr/faidare/brapi/v1/ -t aHR0cHM6Ly9kb2kub3JnLzEwLjE1NDU0L0lBU1NUTg==

on BrAPI2ISA master branch:

real    191m50,114s
user    183m22,339s
sys     0m7,436s

on test/single_assay_by_level branch:

real    143m33,114s
user    135m31,186s
sys     0m4,507s

Cheers

cpommier · 2021-03-16T14:35:09Z

Thanks Erik. isatools 0.12 is definitely an improvement. The single assay by level approach can take advantage of it, but 143m is still rather long. This is only a 3 year, 12 locations, 300 samples dataset. Extract it in a few minutes, without validation, would be really a great thing.
Considering the situation, reviving the fast writer hypothesis, even if it is not that easy to write, could be the most profitable approach. What do you think @proccaserra @djcomlab ?

erikkimmel · 2021-10-05T13:05:38Z

I ran the tests on the master branch of BrAPI2ISA with the 0.13-rc2 version of isatools:

time python3.8 ./plant-brapi-to-isa/brapi_to_isa.py -J -V -e https://urgi.versailles.inra.fr/faidare/brapi/v1/ -t aHR0cHM6Ly9kb2kub3JnLzEwLjE1NDU0L0lBU1NUTg==
[...]
real    8m33,605s
user    1m3,299s
sys     0m2,000s

bedroesb · 2021-10-05T13:10:25Z

this is such great news! Thanks for testing

cpommier · 2022-11-11T09:42:25Z

The one assay per level seems to be a bad approach. Performances improvement is still important, but probably with other approach, using latest isatools mainly.

cpommier assigned cpommier and erikkimmel Dec 2, 2020

cpommier mentioned this issue Dec 3, 2020

igraph optimisation for graphs processing to speedup ISA-Tab write ISA-tools/isa-api#365

Closed

cpommier linked a pull request Mar 5, 2021 that will close this issue

Improve performances by simplifying assay and sample handling #71

Draft

cpommier closed this as completed Nov 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performances by having one assay by level #70

Improve performances by having one assay by level #70

cpommier commented Dec 2, 2020

cpommier commented Dec 2, 2020

bedroesb commented Dec 3, 2020

proccaserra commented Dec 7, 2020 •

edited

Loading

cpommier commented Dec 7, 2020 •

edited

Loading

proccaserra commented Dec 7, 2020

cpommier commented Dec 7, 2020

djcomlab commented Dec 7, 2020

cpommier commented Dec 8, 2020

djcomlab commented Dec 8, 2020

cpommier commented Dec 8, 2020

cpommier commented Dec 8, 2020

djcomlab commented Dec 8, 2020

cpommier commented Dec 8, 2020

proccaserra commented Dec 8, 2020

proccaserra commented Dec 9, 2020

djcomlab commented Dec 15, 2020

proccaserra commented Dec 16, 2020

djcomlab commented Dec 16, 2020 •

edited

Loading

cpommier commented Jan 11, 2021

erikkimmel commented Feb 17, 2021

cpommier commented Mar 5, 2021 •

edited by erikkimmel

Loading

erikkimmel commented Mar 5, 2021

bedroesb commented Mar 5, 2021

erikkimmel commented Mar 5, 2021

bedroesb commented Mar 5, 2021 •

edited

Loading

cpommier commented Mar 5, 2021

erikkimmel commented Mar 8, 2021

erikkimmel commented Mar 16, 2021 •

edited

Loading

cpommier commented Mar 16, 2021

erikkimmel commented Oct 5, 2021

bedroesb commented Oct 5, 2021

cpommier commented Nov 11, 2022

Improve performances by having one assay by level #70

Improve performances by having one assay by level #70

Comments

cpommier commented Dec 2, 2020

cpommier commented Dec 2, 2020

bedroesb commented Dec 3, 2020

proccaserra commented Dec 7, 2020 • edited Loading

cpommier commented Dec 7, 2020 • edited Loading

proccaserra commented Dec 7, 2020

cpommier commented Dec 7, 2020

djcomlab commented Dec 7, 2020

cpommier commented Dec 8, 2020

djcomlab commented Dec 8, 2020

cpommier commented Dec 8, 2020

cpommier commented Dec 8, 2020

djcomlab commented Dec 8, 2020

cpommier commented Dec 8, 2020

proccaserra commented Dec 8, 2020

proccaserra commented Dec 9, 2020

djcomlab commented Dec 15, 2020

proccaserra commented Dec 16, 2020

djcomlab commented Dec 16, 2020 • edited Loading

cpommier commented Jan 11, 2021

erikkimmel commented Feb 17, 2021

cpommier commented Mar 5, 2021 • edited by erikkimmel Loading

erikkimmel commented Mar 5, 2021

bedroesb commented Mar 5, 2021

erikkimmel commented Mar 5, 2021

bedroesb commented Mar 5, 2021 • edited Loading

cpommier commented Mar 5, 2021

erikkimmel commented Mar 8, 2021

erikkimmel commented Mar 16, 2021 • edited Loading

cpommier commented Mar 16, 2021

erikkimmel commented Oct 5, 2021

bedroesb commented Oct 5, 2021

cpommier commented Nov 11, 2022

proccaserra commented Dec 7, 2020 •

edited

Loading

cpommier commented Dec 7, 2020 •

edited

Loading

djcomlab commented Dec 16, 2020 •

edited

Loading

cpommier commented Mar 5, 2021 •

edited by erikkimmel

Loading

bedroesb commented Mar 5, 2021 •

edited

Loading

erikkimmel commented Mar 16, 2021 •

edited

Loading