Centralized database for memory usage data #75

natefoo · 2024-11-04T22:52:37Z

Right now the memory and cores in the database are a bit arbitrary - luckily, we have all the data to reason about better mem values (and cores should probably typically be mem/4 as discussed in #60 except in cases where tools use little memory but see significant speedup with more cores).

If we all pushed our memory usage and input sizes to a centralized database we could both visualize it (similar to how I have done it one-off in this gist) and hopefully automatically make some decisions about memory values in the shared DB.

However there are some things for consideration by people who are good at statistics:

Cutoffs for high and low memory usage (or just use 95th percentile?) since there are outliers
Cutoffs for high and low size inputs, since there is usually a lower bound on memory that does not correlate to inputs at all
Input compression - the mixture of compressed and uncompressed data makes input sizes as a ratio to memory usage kind of a lie
The current memory limit can arbitrarily cut off what would be valid successful jobs and thus skew the data, and this varies by server, although we do know what the limit was for each job
How input size affects memory usage, and of course it is rarely just input size, but also the actual data itself
How recent of jobs to consider, since newer data is typically more useful than older data
Tool versions to consider, since this can drastically effect memory usage, but we also don't necessarily want each +galaxyN version to be separate

The text was updated successfully, but these errors were encountered:

sanjaysrikakulam · 2024-11-05T09:36:24Z

For example, we could use the data from the GRT once the project is resurrected. Last week, I tried to put together some thoughts (please feel free to share your feedback/suggestions) on the project so we could get a master's student working on the project.

nuwang · 2024-11-06T03:08:34Z

@natefoo Have you seen this PR? #64
It was a preliminary attempt at this. The biggest problem so far has been the inconsistency of the data in the federation (e.g. invalid cgroup metrics). But some of these issues have since been fixed I believe, so we may be able to make a fresh pass soon.

natefoo · 2024-11-06T15:15:17Z

Thanks for the heads up! I missed that.

natefoo mentioned this issue Nov 4, 2024

Add mem for all tools with cores and rescale some memory based on usegalaxy.org usage #76

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Centralized database for memory usage data #75

Centralized database for memory usage data #75

natefoo commented Nov 4, 2024

sanjaysrikakulam commented Nov 5, 2024 •

edited

Loading

nuwang commented Nov 6, 2024

natefoo commented Nov 6, 2024

Centralized database for memory usage data #75

Centralized database for memory usage data #75

Comments

natefoo commented Nov 4, 2024

sanjaysrikakulam commented Nov 5, 2024 • edited Loading

nuwang commented Nov 6, 2024

natefoo commented Nov 6, 2024

sanjaysrikakulam commented Nov 5, 2024 •

edited

Loading