Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Centralized database for memory usage data #75

Open
natefoo opened this issue Nov 4, 2024 · 3 comments
Open

Centralized database for memory usage data #75

natefoo opened this issue Nov 4, 2024 · 3 comments

Comments

@natefoo
Copy link
Member

natefoo commented Nov 4, 2024

Right now the memory and cores in the database are a bit arbitrary - luckily, we have all the data to reason about better mem values (and cores should probably typically be mem/4 as discussed in #60 except in cases where tools use little memory but see significant speedup with more cores).

If we all pushed our memory usage and input sizes to a centralized database we could both visualize it (similar to how I have done it one-off in this gist) and hopefully automatically make some decisions about memory values in the shared DB.

However there are some things for consideration by people who are good at statistics:

  • Cutoffs for high and low memory usage (or just use 95th percentile?) since there are outliers
  • Cutoffs for high and low size inputs, since there is usually a lower bound on memory that does not correlate to inputs at all
  • Input compression - the mixture of compressed and uncompressed data makes input sizes as a ratio to memory usage kind of a lie
  • The current memory limit can arbitrarily cut off what would be valid successful jobs and thus skew the data, and this varies by server, although we do know what the limit was for each job
  • How input size affects memory usage, and of course it is rarely just input size, but also the actual data itself
  • How recent of jobs to consider, since newer data is typically more useful than older data
  • Tool versions to consider, since this can drastically effect memory usage, but we also don't necessarily want each +galaxyN version to be separate
@sanjaysrikakulam
Copy link
Collaborator

sanjaysrikakulam commented Nov 5, 2024

For example, we could use the data from the GRT once the project is resurrected. Last week, I tried to put together some thoughts (please feel free to share your feedback/suggestions) on the project so we could get a master's student working on the project.

@nuwang
Copy link
Member

nuwang commented Nov 6, 2024

@natefoo Have you seen this PR? #64
It was a preliminary attempt at this. The biggest problem so far has been the inconsistency of the data in the federation (e.g. invalid cgroup metrics). But some of these issues have since been fixed I believe, so we may be able to make a fresh pass soon.

@natefoo
Copy link
Member Author

natefoo commented Nov 6, 2024

Thanks for the heads up! I missed that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants