Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New thread interface #70

Open
eleon opened this issue Jan 9, 2024 · 40 comments
Open

New thread interface #70

eleon opened this issue Jan 9, 2024 · 40 comments

Comments

@eleon
Copy link
Member

eleon commented Jan 9, 2024

Fine-grain affinity for threads will use similar abstractions to those used for processes in QV.

Here are a few examples for OpenMP and POSIX threads. These assume the process already has a NUMA scope defined.

OpenMP example 1: Launch one thread per core.

/* Use defaults for thread color and num. threads */ 
int *th_color = NULL;
int th_size;
qv_scope_t **th_scope; 

/* We want as many threads as cores */ 
qv_scope_th_split_at(ctx, numa_scope, QV_HW_OBJ_CORE, th_color, th_scope, &th_size);

omp_set_num_threads(th_size);
#pragma omp parallel
{
  qv_bind_push(ctx, th_scope[omp_get_thread_num()]);

  /* Each thread works */
}

/* When we are done with all parallel regions using this layout */ 
for (i=0; i<th_size; i++)
  free(th_scope[i]);

OpenMP example 2: Launch two threads per core.

/* We want 2 threads per core */ 
qv_scope_nobjs(ctx, numa_scope, QV_HW_OBJ_CORE, &ncores);
th_size = 2*ncores;
nchunks = ncores;
qv_scope_t **th_scope; 

for (i=0; i<th_size; i++)
  th_color[i] = i % ncores;  

qv_scope_th_split(ctx, numa_scope, nchunks, th_color, th_scope, &th_size);

omp_set_num_threads(th_size);
#pragma omp parallel
{
  qv_bind_push(ctx, th_scope[omp_get_thread_num()]);

  /* Each thread works */ 
}

/* When we are done with all parallel regions using this layout */ 
for (i=0; i<th_size; i++)
  free(th_scope[i]);

POSIX threads example:

void *thread_work(void *arg)
{
  /* Do the work */ 
}

int main(int argv, char *argv[])
{
  /* Get my numa_scope here */
  
  /* Use defaults for thread color and num. threads */ 
  int *th_color = NULL;
  int th_size;
  qv_scope_t **th_scope; 

  /* We want as many pthreads as hardware threads */ 
  qv_scope_th_split_at(ctx, numa_scope, QV_HW_OBJ_PU, th_color, th_scope, &th_size);

  pthread_t *thid = malloc(th_size * sizeof(pthread_t));
  pthread_attr_t *attr = NULL;   
  void *args = NULL;

  for (i=0; i<th_size; i++) 
    qv_pthread_create(&thid[i], attr, thread_work, args, ctx, th_scope[i]); 

  for (i=0; i<th_size; i++) 
    pthread_join(thid[i], &ret); 

  /* Can launch more parallel regions here before releasing the scopes */ 

  for (i=0; i<th_size; i++)
    free(th_scope[i]);
}

This issue is related to issues #68 and #69. We want a single qv_scope_split and a single qv_scope_split_at, e.g., no special qv_scope_th_split.

An actual Pthreads example can be found here: https://github.com/hpc/quo-vadis/blob/master/tests/test-progress-thread.c

@GuillaumeMercier
Copy link
Collaborator

What is proposed here simplifies the original design while avoiding the coexistence of two different paradigms (original QV + layouts). I think that the current examples will need some marginal modifications in both Pthreads and OpenMP cases. One potential issue is that qv_scope_get only accepts intrinsic scopes as an argument at the moment and we might we to relax this. See my comment here.

@eleon
Copy link
Member Author

eleon commented Jan 31, 2024

Perhaps, we can discuss this in the next meeting?

I don't follow why qv_scope_get is needed in the example (I also need to understand the need for qv_thread_context_create). If a scope is already defined, e.g., mpi_numa_scope, why not use it directly?

@GuillaumeMercier
Copy link
Collaborator

Perhaps, we can discuss this in the next meeting?
Yes, I think so.
I don't follow why qv_scope_get is needed in the example (I also need to understand the need for qv_thread_context_create). If a scope is already defined, e.g., mpi_numa_scope, why not use it directly?

Please bear in mind that this example was written with the old "way" in mind.
qv_thread_context_create was introduced to cache informations and possibly reuse it later instead of recomputing it for reach parallel section. IIRC, I needed a scope for each thread, hence the use of qv_scope_get. But maybe my example was too convoluted and could be simplified. Anyway, should all threads use the same context (as you do in your example)? Is it an issue?

@eleon
Copy link
Member Author

eleon commented Feb 1, 2024

Thanks for clarifying, @GuillaumeMercier .
Yes, once we make more progress on the new thread interface, it would be nice to simplify the example :)
I can't think of any issues with all threads using the same context.

@eleon
Copy link
Member Author

eleon commented Feb 20, 2024

New names should be:

qv_thread_scope_split(...)
qv_thread_scope_split_at(..)

@GuillaumeMercier
Copy link
Collaborator

I'm trying to code both routines and I've got a few questions:

  1. why is th_size passed as a pointer? In which cases split or split_at are going to modifiy its value?
  2. No resource type is passed as an argument to split. Am I expected to retrived this info from the scope? If yes, how to achieve this since in the regular (i.e non-thread) version of split, a type is always passed as an argument.

@eleon
Copy link
Member Author

eleon commented Mar 8, 2024

@GuillaumeMercier , this was the first draft of the interface. The latest one is on Issues #68 and #69 (when we were discussing merging the process and thread interfaces). Please see below and let me know if you still have questions :)

@eleon
Copy link
Member Author

eleon commented Mar 8, 2024

int
qv_thread_scope_split(
    qv_context_t *ctx,
    qv_scope_t *scope,
    int npieces, 
    int *group_id,
    int nthreads, 
    qv_scope_t ***subscope
);

@eleon
Copy link
Member Author

eleon commented Mar 8, 2024

int
qv_thread_scope_split_at(
    qv_context_t *ctx,
    qv_scope_t *scope,
    qv_hw_obj_type_t type,
    int *group_id,
    int nthreads, 
    qv_scope_t ***subscope
);

@GuillaumeMercier
Copy link
Collaborator

@eleon : Ok, thanks. But my second point is still valid, though.

@eleon
Copy link
Member Author

eleon commented Mar 8, 2024

Indeed, @GuillaumeMercier For now, I'd suggest to start with qv_thread_scope_split_at, which does have a type.

For qv_thread_scope_split, users do not have to think in terms of resource types, but in terms of how many pieces to split the input scope into. One could potentially implement qv_thread_scope_split on top of qv_thread_scope_split_at. Happy to talk more about this in the next call.

@GuillaumeMercier
Copy link
Collaborator

Yes, it's exactly my plan. But I you want to implement split on top of split_at you need to retrieve the type you're currently "working on" somehow, hence my second point.

@eleon
Copy link
Member Author

eleon commented Mar 10, 2024

Understood, @GuillaumeMercier . One possible way to implement split on top of split_at is to start with the NUMA type if there's enough nodes to satisfy the number of pieces. Otherwise, pick the next level down the memory hierarchy (e.g., L3) and so on, until the L1 cache/core level. We can discuss in detail on the next call :)

@GuillaumeMercier
Copy link
Collaborator

Hmm. Not sure about that. To me, the logical thing would be to split from the current level of the context. Otherwise, what is the point of having it as a parameter? Or am I missing somrthing?

@eleon
Copy link
Member Author

eleon commented Mar 10, 2024

That makes sense, @GuillaumeMercier . Then, I would say start from the current level and down the hierarchy until you match the number of pieces or the lowest level (PUs).

@GuillaumeMercier
Copy link
Collaborator

Yes, but how do you get the current level? That's my whole point. I don't see a way to do this currently. @samuelkgutierrez: do you have an idea?

@samuelkgutierrez
Copy link
Member

samuelkgutierrez commented Mar 11, 2024

Would something like this work, @GuillaumeMercier?

Get the current level by first getting the input scope's cpuset:qvi_hwpool_cpuset_get(scope->hwpool) (example from https://github.com/hpc/quo-vadis/blob/master/src/qvi-scope.cc#L470-L476).

Then operate on the cpuset as necessary.

If this doesn't work for your use case, then we should have a talk offline.

@GuillaumeMercier
Copy link
Collaborator

I'll take a look and let you know. A question, though: would it be relevant to store the type information in the scope structure directly?

@samuelkgutierrez
Copy link
Member

Can you expand on what you mean by 'type information?'

@GuillaumeMercier
Copy link
Collaborator

Sure. The scope is "associated" (for the lack of a better word to some resources, which have a type
(QV_HW_OBJ_XXX of type qv_hw_obj_type_t). So, would it be relevant to have a QV_HW_OBJ_XXX for a scope?

@samuelkgutierrez
Copy link
Member

Oh, I see.

A scope is associated with a hardware pool that manages the low-level details of the types of resources that a scope manages. In that way, a scope already has that type information associated with it. More directly to your question: the cpuset associated with a hardware pool more generally describes the CPU resources tied to that scope. We have functions that allow us to query the types of resources covered by that cpuset, but there may not be a perfect one-to-one correspondence, since the cpuset could be irregular, potentially covering multiple simple types that we expose to a user.

If you're interested, let's just have a call. Send me an email, if you are interested. Will be much quicker, I think.

@GuillaumeMercier
Copy link
Collaborator

I can't make a call today. But as I wrote, I'll try with the information you gave me (very helpful). Once I've tried to code something it will be more interesting to talk about this point again. As for the irregular cpusets management, I coded something about that in my Hsplit library recently to support some MPI 4.1 feature that could be applicable here too.

@samuelkgutierrez
Copy link
Member

Sounds like a plan, @GuillaumeMercier.

@samuelkgutierrez
Copy link
Member

@GuillaumeMercier qvi_global_split() is a nice entry point that shows the different splitting and info functions called from a different context.

@GuillaumeMercier
Copy link
Collaborator

GuillaumeMercier commented Mar 13, 2024

@GuillaumeMercier qvi_global_split() is a nice entry point that shows the different splitting and info functions called from a different context.

What about global_split_user_defined?
Also a pointer to an object of type qvi_hwloc_t can be obtained through either a context or a scope. What is the difference (if any)?

@samuelkgutierrez
Copy link
Member

global_split_user_defined() implements splitting of resources based on a user-defined coloring (e.g., from qv_scope_split()). There are other examples like global_split_affinity_preserving() that automatically split resources based on other criteria. The eventual goal is for you to be able to use this infrastructure for your mapping. Maybe it might work now, but I couldn't say that it does for sure. It may need some work.

To your second question: both should point to the same qvi_hwloc_t instance.

@GuillaumeMercier
Copy link
Collaborator

I saw this comment in qvi_global_split :

// If all the values are negative and equal, then auto split. If not, then
// we were called with an invalid request. 

Is it really invalid? Maybe I'm biased by MPI but a process might not want to be part of a split by giving as a color MPI_UNDEFINED (usually coded as a negative value). You then have a mix of positive and negative values for the colors and it's legit. Do we forbid this kind of behaviour in QV?

@samuelkgutierrez
Copy link
Member

As of today, the comment is correct because we don't implement those semantics just yet.

@GuillaumeMercier
Copy link
Collaborator

I had my epiphany today about this split operation and:

  1. Thread split and regular split are different operations: one is local while the other is collective (global). Therefore having two different function makes totally sense
  2. Therefore, I don't think that looking at the other (collective) split operations to have an idea on how to implement the thread version is the way to go. For instance, instead of using split_hardware_ressources, I was thinking on looking at
    qvi_hwloc_split_cpuset_by_color used in global_split_user_defined
    @samuelkgutierrez, what is your take on this?

@samuelkgutierrez
Copy link
Member

samuelkgutierrez commented Mar 20, 2024

You are correct that split_hardware_resources() is collective, but that function should work for both local and distributed calls because of how the group code is designed. For example, the process interface is able to use the same infrastructure as the MPI interface because of this design.

I would suggest first seeing if the qvi_scope_split() and qvi_scope_split_at() interfaces will work for you. I think they should. If those don't seem to work, digging deeper into the interfaces might be okay initially (to get something implemented). However, the top-level calls should be expressive enough to handle your case as well. If not, we should probably fix them.

@samuelkgutierrez
Copy link
Member

I just realized something, @GuillaumeMercier. qvi_scope_split() et al. are limited to returning a single child, so you'll likely have to go another route for now because of this limitation.

@GuillaumeMercier
Copy link
Collaborator

GuillaumeMercier commented Mar 20, 2024

Yes, I know but I was thinking about calling the routine num_colors times (which is a very naïve way of doing things).

@samuelkgutierrez
Copy link
Member

Ah, that might work! I hope it does.

@samuelkgutierrez
Copy link
Member

samuelkgutierrez commented Mar 20, 2024

cmake -DCMAKE_BUILD_TYPE=Debug should enable debug symbols. I find that using ccmake's TUI is nice to fine-tune the configuration, but it takes a little getting used to.

@GuillaumeMercier
Copy link
Collaborator

Aah. I was looking for this: ccmake.

@samuelkgutierrez
Copy link
Member

make VERBOSE=1 is also nice for debugging the build process.

@GuillaumeMercier
Copy link
Collaborator

GuillaumeMercier commented Apr 8, 2024

I would suggest first seeing if the qvi_scope_split() and qvi_scope_split_at() interfaces will work for you. I think they should. If those don't seem to work, digging deeper into the interfaces might be okay initially (to get something implemented). However, the top-level calls should be expressive enough to handle your case as well. If not, we should probably fix them.

The current issue with qvi_scope_split() and qvi_scope_split_at() is that currently, they call split_hardware_resources which fails when the color number is larger than the number of required splits. This limit should be lifted. For instance, I should be able to use color 123 and 456 and ask for two splits. Also I don't understand why qvi_scope_split needs a numbers of colors as an argument: this number should be determined by the color argument provided by the calling processes. Since this is a collective call, this piece of information could be determined and exchanged. Or maybe I am missing something? I admit that I'm heavily biased by MPI_Comm_split and MPI_Comm_split_type here.

@samuelkgutierrez
Copy link
Member

All good points, @GuillaumeMercier. Let's add these issues to our meeting's agenda.

@GuillaumeMercier
Copy link
Collaborator

@samuelkgutierrez : I put this comment so that we can think about it before the meeting.

@GuillaumeMercier
Copy link
Collaborator

And yes, I'm stuck with the implementation ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants