Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PAPI connector: per thread metrics #94

Open
keichi opened this issue Aug 4, 2020 · 5 comments
Open

PAPI connector: per thread metrics #94

keichi opened this issue Aug 4, 2020 · 5 comments
Labels
feature Needed feature but software still is correct on its own

Comments

@keichi
Copy link
Contributor

keichi commented Aug 4, 2020

I noticed that the PAPI connector records the events only on the master thread. This is because PAPI performance counters are thread-local and PAPI_hl_region_begin/PAPI_hl_region_end are called from the master thread only.

It would be nice if the PAPI connector could record the performance counters on all threads. However, I'm guessing this needs changes in Kokkos itself to call the profiling hooks from inside the parallel regions.

@DavidPoliakoff
Copy link
Contributor

@keichi , yeah, all the tricks I can think of would require some pretty invasive changes to Kokkos to make this work. As a workaround, you can of course use a sampling tool that has PAPI support, but I acknowledge that's an unsatisfying answer. I'm going to file an issue against the main Kokkos repo to talk about adding this support, but it is likely to be a major effort. This is a really interesting problem, thanks for bringing it to our attention

@jrmadsen
Copy link
Contributor

jrmadsen commented Aug 5, 2020

A scheme that could potentially work but would be a bit of work: use gotcha to wrap pthread start and use that gotcha to set an alarm which only gets handled on that thread. When that alarm is delivered, it does a papi read of the current counters and just updates the values in a global location. When the master thread updates, it just reads in the per-thread array that has the latest value from the last time the alarm was delivered on each thread.

@jrmadsen
Copy link
Contributor

jrmadsen commented Aug 5, 2020

Or actually, forget the gotcha. Kokkos could add a tool initialization call on each thread that it initializes.

@keichi
Copy link
Contributor Author

keichi commented Aug 6, 2020

Thank you for looking into this.

Kokkos could add a tool initialization call on each thread that it initializes.

To me, this approach seems relatively simple and noninvasive. All existing Kokkos profiling hooks are backend-agnostic so changing that might be a problem?

@jrmadsen
Copy link
Contributor

jrmadsen commented Aug 6, 2020

Well, we wouldn't call the hooks from CUDA threads but we calling the thread initialization routine from each thread in the pthread backend when we create the pool and calling the routine from each thread in the openmp CPU threading backend shouldn't be invasive at all and it's not like it's an overhead issue -- this would be a routine that only gets called once on a thread after Kokkos initialize. Anything more would open us up to performance degradation and using that alarm scheme would provide a relatively easy way to do it vs. how you'd have to do it otherwise (intercept thread creation function call)

@vlkale vlkale added the feature Needed feature but software still is correct on its own label Apr 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Needed feature but software still is correct on its own
Projects
None yet
Development

No branches or pull requests

4 participants