Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make UCX/UCC required dependencies #184

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dalcinl
Copy link
Contributor

@dalcinl dalcinl commented Oct 24, 2024

Checklist

  • Used a personal fork of the feedstock to propose changes
  • Bumped the build number (if the version is unchanged)
  • Reset the build number to 0 (if the version changed)
  • Re-rendered with the latest conda-smithy (Use the phrase @conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
  • Ensured the license file is being packaged.

xref #181
xref conda-forge/mpich-feedstock#104

@conda-forge-admin
Copy link
Contributor

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found it was in an excellent condition.

@dalcinl dalcinl marked this pull request as draft October 24, 2024 13:20
@@ -83,12 +82,11 @@ fi
--with-hwloc=$PREFIX \
--with-libevent=$PREFIX \
--with-zlib=$PREFIX \
--enable-mca-dso \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@minrk I seems it works w.r.t. CUDA. Am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like cuda components are DSOs by default in openmpi 5: open-mpi/ompi#12055 (also reflected in build output, which shows all the components that link libcuda are in DSOs).

but the docs haven't been updated to reflect that: open-mpi/ompi#12911

@dalcinl
Copy link
Contributor Author

dalcinl commented Oct 24, 2024

So far, I have addressed the build part of it. I have not changed defaults, that is, UCX/UCC are still disabled.

@minrk minrk changed the title Update UCX/UCC support Make UCX/UCC required dependencies Nov 6, 2024
@minrk
Copy link
Member

minrk commented Nov 6, 2024

At the very least, I think we need to update the messages for ucx/ucc support that state these packages need to be installed. I would also argue since they are packaged, we should remove the messages for ucx/ucc entirely, since there is nothing unusual there. It is the purview of documentation (i.e. GitHub issues here, I guess).

I also lean toward removing all of the changed defaults that disable ucx/ucc, since the main reason was that it would break without them. That is no longer true, but I recognize that possible changes to defaults not tied to a version can be disruptive. My inclination, though, is to let the package's defaults be as default as they can be. But if folks want to tie that to a version update, I'm okay with that as well. But is it possible to keep the defaults without disabling features and then remove the overrides on a version bump?

@minrk
Copy link
Member

minrk commented Nov 6, 2024

Note that it looks like ucx/ucc are already not the default, so we should be able to remove those lines from ompi-mca-params.conf without any effect. This is the output of mpiexec --allow-run-as-root --mca mpi_show_mca_params all -n 2 python3 -c "from mpi4py import MPI" (linux-aarch64) after removing the ucc/ucx lines from openmpi-mca-params.conf:

[1baf99fb0f7b:00209] base_help_aggregate=true (default)
[1baf99fb0f7b:00209] mca_base_param_files=/opt/conda/etc/openmpi-mca-params.conf (default)
[1baf99fb0f7b:00209] mca_param_files=/opt/conda/etc/openmpi-mca-params.conf (default)
[1baf99fb0f7b:00209] mca_base_override_param_file=/opt/conda/etc/openmpi-mca-params-override.conf (default)
[1baf99fb0f7b:00209] mca_base_suppress_override_warning=false (default)
[1baf99fb0f7b:00209] mca_base_param_file_prefix= (default)
[1baf99fb0f7b:00209] mca_base_envar_file_prefix= (default)
[1baf99fb0f7b:00209] mca_base_param_file_path=/opt/conda/share/openmpi/amca-param-sets:/io (default)
[1baf99fb0f7b:00209] mca_base_param_file_path_force= (default)
[1baf99fb0f7b:00209] opal_signal=6,7,8,11 (default)
[1baf99fb0f7b:00209] opal_stacktrace_output=stderr (default)
[1baf99fb0f7b:00209] opal_net_private_ipv4=10.0.0.0/8;172.16.0.0/12;192.168.0.0/16;169.254.0.0/16 (default)
[1baf99fb0f7b:00209] opal_set_max_sys_limits= (default)
[1baf99fb0f7b:00209] opal_var_dump_color=name=34,value=32,valid_values=36 (default)
[1baf99fb0f7b:00209] opal_built_with_cuda_support=true (default)
[1baf99fb0f7b:00209] opal_cuda_support=false (file (/opt/conda/etc/openmpi-mca-params.conf:3))
[1baf99fb0f7b:00209] opal_warn_on_missing_libcuda=false (file (/opt/conda/etc/openmpi-mca-params.conf:2))
[1baf99fb0f7b:00209] mpi_leave_pinned=auto (default)
[1baf99fb0f7b:00209] opal_leave_pinned=auto (default)
[1baf99fb0f7b:00209] mpi_leave_pinned_pipeline=false (default)
[1baf99fb0f7b:00209] opal_leave_pinned_pipeline=false (default)
[1baf99fb0f7b:00209] mpi_warn_on_fork=true (default)
[1baf99fb0f7b:00209] opal_abort_delay=0 (default)
[1baf99fb0f7b:00209] opal_abort_print_stack=false (default)
[1baf99fb0f7b:00209] mca_base_env_list= (default)
[1baf99fb0f7b:00209] mca_base_env_list_delimiter=; (default)
[1baf99fb0f7b:00209] opal_max_thread_in_progress=1 (default)
[1baf99fb0f7b:00209] mca_base_component_path=/opt/conda/lib/openmpi:/root/.openmpi/components (default)
[1baf99fb0f7b:00209] mca_component_path=/opt/conda/lib/openmpi:/root/.openmpi/components (default)
[1baf99fb0f7b:00209] mca_base_component_show_load_errors=0 (file (/opt/conda/etc/openmpi-mca-params.conf:1))
[1baf99fb0f7b:00209] mca_component_show_load_errors=0 (file (/opt/conda/etc/openmpi-mca-params.conf:1))
[1baf99fb0f7b:00209] mca_base_component_track_load_errors=false (default)
[1baf99fb0f7b:00209] mca_base_component_disable_dlopen=false (default)
[1baf99fb0f7b:00209] mca_component_disable_dlopen=false (default)
[1baf99fb0f7b:00209] mca_base_verbose=stderr (default)
[1baf99fb0f7b:00209] mca_verbose=stderr (default)
[1baf99fb0f7b:00209] dl= (default)
[1baf99fb0f7b:00209] dl_base_verbose=error (default)
[1baf99fb0f7b:00209] dl_dlopen_filename_suffixes=.so,.dylib,.dll,.sl (default)
[1baf99fb0f7b:00209] mpi_ft_enable=false (default)
[1baf99fb0f7b:00209] mpi_ft_verbose=65535 (default)
[1baf99fb0f7b:00209] mpi_ft_reliable_bcast=1 (default)
[1baf99fb0f7b:00209] mpi_ft_propagator_with_rbcast=false (default)
[1baf99fb0f7b:00209] mpi_ft_detector=false (default)
[1baf99fb0f7b:00209] mpi_ft_detector_thread=false (default)
[1baf99fb0f7b:00209] mpi_ft_detector_period=3.000000 (default)
[1baf99fb0f7b:00209] mpi_ft_detector_timeout=10.000000 (default)
[1baf99fb0f7b:00209] mpi_ft_detector_rdma_heartbeat=false (default)
[1baf99fb0f7b:00209] mpi_param_check=true (default)
[1baf99fb0f7b:00209] mpi_yield_when_idle=false (default)
[1baf99fb0f7b:00209] mpi_event_tick_rate=-1 (default)
[1baf99fb0f7b:00209] mpi_show_handle_leaks=false (default)
[1baf99fb0f7b:00209] mpi_no_free_handles=false (default)
[1baf99fb0f7b:00209] mpi_show_mpi_alloc_mem_leaks=0 (default)
[1baf99fb0f7b:00209] mpi_show_mca_params=all (environment)
[1baf99fb0f7b:00209] mpi_show_mca_params_file= (default)
[1baf99fb0f7b:00209] mpi_preconnect_all=false (default)
[1baf99fb0f7b:00209] mpi_have_sparse_group_storage=false (default)
[1baf99fb0f7b:00209] mpi_use_sparse_group_storage=false (default)
[1baf99fb0f7b:00209] mpi_cuda_support=false (file (/opt/conda/etc/openmpi-mca-params.conf:3))
[1baf99fb0f7b:00209] mpi_built_with_cuda_support=true (default)
[1baf99fb0f7b:00209] mpi_add_procs_cutoff=0 (default)
[1baf99fb0f7b:00209] mpi_dynamics_enabled=true (default)
[1baf99fb0f7b:00209] async_mpi_init=false (default)
[1baf99fb0f7b:00209] async_mpi_finalize=false (default)
[1baf99fb0f7b:00209] mpi_abort_delay=0 (default)
[1baf99fb0f7b:00209] mpi_abort_print_stack=false (default)
[1baf99fb0f7b:00209] mpi_compat_mpi3=true (default)
[1baf99fb0f7b:00209] mpi_pmix_connect_timeout=0 (default)
[1baf99fb0f7b:00209] ompi_timing=false (default)
[1baf99fb0f7b:00209] ompi_stream_buffering=-1 (default)
[1baf99fb0f7b:00209] if= (default)
[1baf99fb0f7b:00209] if_base_verbose=error (default)
[1baf99fb0f7b:00209] if_base_do_not_resolve=false (default)
[1baf99fb0f7b:00209] if_base_retain_loopback=false (default)
[1baf99fb0f7b:00209] threads= (default)
[1baf99fb0f7b:00209] threads_base_verbose=error (default)
[1baf99fb0f7b:00209] threads_pthreads_yield_strategy=sched_yield (default)
[1baf99fb0f7b:00209] threads_pthreads_nanosleep_time=1 (default)
[1baf99fb0f7b:00209] hwloc= (default)
[1baf99fb0f7b:00209] hwloc_base_verbose=error (default)
[1baf99fb0f7b:00209] hwloc_base_mem_bind_failure_action=warn (default)
[1baf99fb0f7b:00209] memcpy= (default)
[1baf99fb0f7b:00209] memcpy_base_verbose=error (default)
[1baf99fb0f7b:00209] memchecker= (default)
[1baf99fb0f7b:00209] memchecker_base_verbose=error (default)
[1baf99fb0f7b:00209] backtrace= (default)
[1baf99fb0f7b:00209] backtrace_base_verbose=error (default)
[1baf99fb0f7b:00209] timer= (default)
[1baf99fb0f7b:00209] timer_base_verbose=error (default)
[1baf99fb0f7b:00209] timer_require_monotonic=true (default)
[1baf99fb0f7b:00209] shmem= (default)
[1baf99fb0f7b:00209] shmem_base_verbose=error (default)
[1baf99fb0f7b:00209] shmem_mmap_priority=50 (default)
[1baf99fb0f7b:00209] shmem_mmap_enable_nfs_warning=true (default)
[1baf99fb0f7b:00209] shmem_mmap_relocate_backing_file=0 (default)
[1baf99fb0f7b:00209] shmem_mmap_backing_file_base_dir=/dev/shm (default)
[1baf99fb0f7b:00209] reachable= (default)
[1baf99fb0f7b:00209] reachable_base_verbose=error (default)
[1baf99fb0f7b:00209] pmix= (default)
[1baf99fb0f7b:00209] pmix_base_verbose=error (default)
[1baf99fb0f7b:00209] pmix_base_async_modex=false (default)
[1baf99fb0f7b:00209] pmix_base_collect_data=true (default)
[1baf99fb0f7b:00209] pmix_base_exchange_timeout=-1 (default)
[1baf99fb0f7b:00209] accelerator= (default)
[1baf99fb0f7b:00209] accelerator_base_verbose=error (default)
[1baf99fb0f7b:00209] opal_event_include=epoll (default)
[1baf99fb0f7b:00209] event_external_include=epoll (default)
[1baf99fb0f7b:00209] opal_event_verbose=error (default)
[1baf99fb0f7b:00209] event_base_verbose=error (default)
[1baf99fb0f7b:00209] hook= (default)
[1baf99fb0f7b:00209] hook_base_verbose=error (default)
[1baf99fb0f7b:00209] hook_comm_method_verbose=0 (default)
[1baf99fb0f7b:00209] hook_comm_method_display= (default)
[1baf99fb0f7b:00209] hook_comm_method_max=12 (default)
[1baf99fb0f7b:00209] hook_comm_method_brief=false (default)
[1baf99fb0f7b:00209] hook_comm_method_fakefile= (default)
[1baf99fb0f7b:00209] op= (default)
[1baf99fb0f7b:00209] op_base_verbose=error (default)
[1baf99fb0f7b:00209] op_aarch64_hardware_available=1 (default)
[1baf99fb0f7b:00209] op_aarch64_double_supported=false (default)
[1baf99fb0f7b:00209] allocator= (default)
[1baf99fb0f7b:00209] allocator_base_verbose=error (default)
[1baf99fb0f7b:00209] allocator_bucket_num_buckets=30 (default)
[1baf99fb0f7b:00209] rcache= (default)
[1baf99fb0f7b:00209] rcache_base_verbose=error (default)
[1baf99fb0f7b:00209] rcache_grdma_print_stats=false (default)
[1baf99fb0f7b:00209] mpool= (default)
[1baf99fb0f7b:00209] mpool_base_verbose=error (default)
[1baf99fb0f7b:00209] mpool_hugepage_priority=50 (default)
[1baf99fb0f7b:00209] mpool_hugepage_page_size=2097152 (default)
[1baf99fb0f7b:00209] smsc= (default)
[1baf99fb0f7b:00209] smsc_base_verbose=error (default)
[1baf99fb0f7b:00209] smsc_cma_priority=37 (default)
[1baf99fb0f7b:00209] bml= (default)
[1baf99fb0f7b:00209] bml_base_verbose=error (default)
[1baf99fb0f7b:00209] bml_r2_show_unreach_errors=true (default)
[1baf99fb0f7b:00209] btl= (default)
[1baf99fb0f7b:00209] btl_base_verbose=error (default)
[1baf99fb0f7b:00209] btl_base_include= (default)
[1baf99fb0f7b:00209] btl_base_exclude= (default)
[1baf99fb0f7b:00209] btl_base_warn_peer_error=true (default)
[1baf99fb0f7b:00209] btl_base_warn_component_unused=1 (default)
[1baf99fb0f7b:00209] btl_sm_free_list_num=8 (default)
[1baf99fb0f7b:00209] btl_vader_free_list_num=8 (default)
[1baf99fb0f7b:00209] btl_sm_free_list_max=512 (default)
[1baf99fb0f7b:00209] btl_vader_free_list_max=512 (default)
[1baf99fb0f7b:00209] btl_sm_free_list_inc=64 (default)
[1baf99fb0f7b:00209] btl_vader_free_list_inc=64 (default)
[1baf99fb0f7b:00209] btl_sm_memcpy_limit=524288 (default)
[1baf99fb0f7b:00209] btl_vader_memcpy_limit=524288 (default)
[1baf99fb0f7b:00209] btl_sm_segment_size=16777216 (default)
[1baf99fb0f7b:00209] btl_vader_segment_size=16777216 (default)
[1baf99fb0f7b:00209] btl_sm_max_inline_send=256 (default)
[1baf99fb0f7b:00209] btl_vader_max_inline_send=256 (default)
[1baf99fb0f7b:00209] btl_sm_fbox_threshold=16 (default)
[1baf99fb0f7b:00209] btl_vader_fbox_threshold=16 (default)
[1baf99fb0f7b:00209] btl_sm_fbox_max=32 (default)
[1baf99fb0f7b:00209] btl_vader_fbox_max=32 (default)
[1baf99fb0f7b:00209] btl_sm_fbox_size=4096 (default)
[1baf99fb0f7b:00209] btl_vader_fbox_size=4096 (default)
[1baf99fb0f7b:00209] btl_sm_backing_directory=/dev/shm (default)
[1baf99fb0f7b:00209] btl_vader_backing_directory=/dev/shm (default)
[1baf99fb0f7b:00209] btl_sm_exclusivity=65536 (default)
[1baf99fb0f7b:00209] btl_vader_exclusivity=65536 (default)
[1baf99fb0f7b:00209] btl_sm_flags=send,put,get,inplace (default)
[1baf99fb0f7b:00209] btl_vader_flags=send,put,get,inplace (default)
[1baf99fb0f7b:00209] btl_sm_atomic_flags= (default)
[1baf99fb0f7b:00209] btl_vader_atomic_flags= (default)
[1baf99fb0f7b:00209] btl_sm_rndv_eager_limit=32768 (default)
[1baf99fb0f7b:00209] btl_vader_rndv_eager_limit=32768 (default)
[1baf99fb0f7b:00209] btl_sm_eager_limit=4096 (default)
[1baf99fb0f7b:00209] btl_vader_eager_limit=4096 (default)
[1baf99fb0f7b:00209] btl_sm_accelerator_eager_limit=0 (default)
[1baf99fb0f7b:00209] btl_vader_accelerator_eager_limit=0 (default)
[1baf99fb0f7b:00209] btl_sm_accelerator_rdma_limit=18446744073709551615 (default)
[1baf99fb0f7b:00209] btl_vader_accelerator_rdma_limit=18446744073709551615 (default)
[1baf99fb0f7b:00209] btl_sm_accelerator_max_send_size=0 (default)
[1baf99fb0f7b:00209] btl_vader_accelerator_max_send_size=0 (default)
[1baf99fb0f7b:00209] btl_sm_max_send_size=32768 (default)
[1baf99fb0f7b:00209] btl_vader_max_send_size=32768 (default)
[1baf99fb0f7b:00209] btl_vader_major_version=5 (default)
[1baf99fb0f7b:00209] btl_vader_minor_version=0 (default)
[1baf99fb0f7b:00209] btl_vader_release_version=5 (default)
[1baf99fb0f7b:00209] btl_self_free_list_num=0 (default)
[1baf99fb0f7b:00209] btl_self_free_list_max=64 (default)
[1baf99fb0f7b:00209] btl_self_free_list_inc=8 (default)
[1baf99fb0f7b:00209] btl_self_exclusivity=65536 (default)
[1baf99fb0f7b:00209] btl_self_atomic_flags= (default)
[1baf99fb0f7b:00209] btl_self_rndv_eager_limit=131072 (default)
[1baf99fb0f7b:00209] btl_self_eager_limit=1024 (default)
[1baf99fb0f7b:00209] btl_self_get_limit=18446744073709551615 (default)
[1baf99fb0f7b:00209] btl_self_get_alignment=0 (default)
[1baf99fb0f7b:00209] btl_self_put_limit=18446744073709551615 (default)
[1baf99fb0f7b:00209] btl_self_put_alignment=0 (default)
[1baf99fb0f7b:00209] btl_self_accelerator_eager_limit=0 (default)
[1baf99fb0f7b:00209] btl_self_accelerator_rdma_limit=18446744073709551615 (default)
[1baf99fb0f7b:00209] btl_self_accelerator_max_send_size=0 (default)
[1baf99fb0f7b:00209] btl_self_max_send_size=16384 (default)
[1baf99fb0f7b:00209] btl_self_rdma_pipeline_send_length=2147483647 (default)
[1baf99fb0f7b:00209] btl_self_rdma_pipeline_frag_size=2147483647 (default)
[1baf99fb0f7b:00209] btl_self_min_rdma_pipeline_size=2147484671 (default)
[1baf99fb0f7b:00209] btl_self_latency=0 (default)
[1baf99fb0f7b:00209] btl_self_bandwidth=100 (default)
[1baf99fb0f7b:00209] btl_uct_memory_domains=mlx5_0,mlx4_0 (default)
[1baf99fb0f7b:00209] btl_uct_transports=dc_mlx5,rc_mlx5,ud,ugni_rdma,ugni_smsg,any (default)
[1baf99fb0f7b:00209] btl_uct_num_contexts_per_module=5 (default)
[1baf99fb0f7b:00209] btl_uct_disable_ucx_memory_hooks=true (default)
[1baf99fb0f7b:00209] btl_uct_bind_threads_to_contexts=true (default)
[1baf99fb0f7b:00209] btl_uct_exclusivity=65536 (default)
[1baf99fb0f7b:00209] btl_uct_flags=put,get,atomics,fetching-atomics (default)
[1baf99fb0f7b:00209] btl_uct_rndv_eager_limit=8192 (default)
[1baf99fb0f7b:00209] btl_uct_eager_limit=8192 (default)
[1baf99fb0f7b:00209] btl_uct_get_limit=8388608 (default)
[1baf99fb0f7b:00209] btl_uct_get_alignment=0 (default)
[1baf99fb0f7b:00209] btl_uct_put_limit=8388608 (default)
[1baf99fb0f7b:00209] btl_uct_put_alignment=0 (default)
[1baf99fb0f7b:00209] btl_uct_accelerator_eager_limit=0 (default)
[1baf99fb0f7b:00209] btl_uct_accelerator_rdma_limit=18446744073709551615 (default)
[1baf99fb0f7b:00209] btl_uct_accelerator_max_send_size=0 (default)
[1baf99fb0f7b:00209] btl_uct_max_send_size=65536 (default)
[1baf99fb0f7b:00209] btl_uct_rdma_pipeline_send_length=8192 (default)
[1baf99fb0f7b:00209] btl_uct_rdma_pipeline_frag_size=4194304 (default)
[1baf99fb0f7b:00209] btl_uct_min_rdma_pipeline_size=16384 (default)
[1baf99fb0f7b:00209] btl_uct_latency=0 (default)
[1baf99fb0f7b:00209] btl_uct_bandwidth=0 (default)
[1baf99fb0f7b:00209] btl_tcp_links=1 (default)
[1baf99fb0f7b:00209] btl_tcp_if_include= (default)
[1baf99fb0f7b:00209] btl_tcp_if_exclude=lo,sppp (default)
[1baf99fb0f7b:00209] btl_tcp_free_list_num=8 (default)
[1baf99fb0f7b:00209] btl_tcp_free_list_max=-1 (default)
[1baf99fb0f7b:00209] btl_tcp_free_list_inc=32 (default)
[1baf99fb0f7b:00209] btl_tcp_sndbuf=0 (default)
[1baf99fb0f7b:00209] btl_tcp_rcvbuf=0 (default)
[1baf99fb0f7b:00209] btl_tcp_endpoint_cache=30720 (default)
[1baf99fb0f7b:00209] btl_tcp_use_nagle=0 (default)
[1baf99fb0f7b:00209] btl_tcp_port_min_v4=1024 (default)
[1baf99fb0f7b:00209] btl_tcp_port_range_v4=64511 (default)
[1baf99fb0f7b:00209] btl_tcp_port_min_v6=1024 (default)
[1baf99fb0f7b:00209] btl_tcp_port_range_v6=64511 (default)
[1baf99fb0f7b:00209] btl_tcp_progress_thread=0 (default)
[1baf99fb0f7b:00209] btl_tcp_warn_all_unfound_interfaces=false (default)
[1baf99fb0f7b:00209] btl_tcp_exclusivity=100 (default)
[1baf99fb0f7b:00209] btl_tcp_flags=send,put,inplace,need-ack,need-csum,hetero-rdma (default)
[1baf99fb0f7b:00209] btl_tcp_atomic_flags= (default)
[1baf99fb0f7b:00209] btl_tcp_rndv_eager_limit=65536 (default)
[1baf99fb0f7b:00209] btl_tcp_eager_limit=65536 (default)
[1baf99fb0f7b:00209] btl_tcp_put_limit=18446744073709551615 (default)
[1baf99fb0f7b:00209] btl_tcp_put_alignment=0 (default)
[1baf99fb0f7b:00209] btl_tcp_accelerator_eager_limit=0 (default)
[1baf99fb0f7b:00209] btl_tcp_accelerator_rdma_limit=18446744073709551615 (default)
[1baf99fb0f7b:00209] btl_tcp_accelerator_max_send_size=0 (default)
[1baf99fb0f7b:00209] btl_tcp_max_send_size=131072 (default)
[1baf99fb0f7b:00209] btl_tcp_rdma_pipeline_send_length=131072 (default)
[1baf99fb0f7b:00209] btl_tcp_rdma_pipeline_frag_size=2147482624 (default)
[1baf99fb0f7b:00209] btl_tcp_min_rdma_pipeline_size=196608 (default)
[1baf99fb0f7b:00209] btl_tcp_latency=0 (default)
[1baf99fb0f7b:00209] btl_tcp_bandwidth=0 (default)
[1baf99fb0f7b:00209] btl_tcp_disable_family=0 (default)
[1baf99fb0f7b:00209] pml= (default)
[1baf99fb0f7b:00209] pml_base_verbose=error (default)
[1baf99fb0f7b:00209] pml_base_bsend_allocator=basic (default)
[1baf99fb0f7b:00209] pml_base_wrapper= (default)
[1baf99fb0f7b:00209] pml_wrapper= (default)
[1baf99fb0f7b:00209] pml_base_check_pml=true (default)
[1baf99fb0f7b:00209] opal_common_ucx_verbose=0 (default)
[1baf99fb0f7b:00209] opal_common_ucx_progress_iterations=100 (default)
[1baf99fb0f7b:00209] opal_common_ucx_opal_mem_hooks=true (default)
[1baf99fb0f7b:00209] opal_common_ucx_tls=rc_verbs,ud_verbs,rc_mlx5,dc_mlx5,ud_mlx5,cuda_ipc,rocm_ipc (default)
[1baf99fb0f7b:00209] opal_common_ucx_devices=mlx* (default)
[1baf99fb0f7b:00209] pml_ob1_verbose=0 (default)
[1baf99fb0f7b:00209] pml_ob1_free_list_num=4 (default)
[1baf99fb0f7b:00209] pml_ob1_free_list_max=-1 (default)
[1baf99fb0f7b:00209] pml_ob1_free_list_inc=64 (default)
[1baf99fb0f7b:00209] pml_ob1_priority=20 (default)
[1baf99fb0f7b:00209] pml_ob1_send_pipeline_depth=3 (default)
[1baf99fb0f7b:00209] pml_ob1_recv_pipeline_depth=4 (default)
[1baf99fb0f7b:00209] pml_ob1_max_rdma_per_request=4 (default)
[1baf99fb0f7b:00209] pml_ob1_max_send_per_range=4 (default)
[1baf99fb0f7b:00209] pml_ob1_unexpected_limit=128 (default)
[1baf99fb0f7b:00209] pml_ob1_use_all_rdma=false (default)
[1baf99fb0f7b:00209] pml_ob1_allocator=bucket (default)
[1baf99fb0f7b:00209] pml_ob1_accelerator_events_max=400 (default)
[1baf99fb0f7b:00209] memory= (default)
[1baf99fb0f7b:00209] memory_base_verbose=error (default)
[1baf99fb0f7b:00209] memory_patcher_priority=80 (default)
[1baf99fb0f7b:00209] patcher= (default)
[1baf99fb0f7b:00209] patcher_base_verbose=error (default)
[1baf99fb0f7b:00209] patcher_overwrite_priority=37 (default)
[1baf99fb0f7b:00209] coll= (default)
[1baf99fb0f7b:00209] coll_base_verbose=error (default)
[1baf99fb0f7b:00209] coll_han_priority=35 (default)
[1baf99fb0f7b:00209] coll_han_verbose=0 (default)
[1baf99fb0f7b:00209] coll_han_bcast_segsize=65536 (default)
[1baf99fb0f7b:00209] coll_han_bcast_up_module=self (default)
[1baf99fb0f7b:00209] coll_han_bcast_low_module=self (default)
[1baf99fb0f7b:00209] coll_han_reduce_segsize=65536 (default)
[1baf99fb0f7b:00209] coll_han_reduce_up_module=self (default)
[1baf99fb0f7b:00209] coll_han_reduce_low_module=self (default)
[1baf99fb0f7b:00209] coll_han_allreduce_segsize=65536 (default)
[1baf99fb0f7b:00209] coll_han_allreduce_up_module=self (default)
[1baf99fb0f7b:00209] coll_han_allreduce_low_module=self (default)
[1baf99fb0f7b:00209] coll_han_allgather_up_module=self (default)
[1baf99fb0f7b:00209] coll_han_allgather_low_module=self (default)
[1baf99fb0f7b:00209] coll_han_gather_up_module=self (default)
[1baf99fb0f7b:00209] coll_han_gather_low_module=self (default)
[1baf99fb0f7b:00209] coll_han_scatter_up_module=self (default)
[1baf99fb0f7b:00209] coll_han_scatter_low_module=self (default)
[1baf99fb0f7b:00209] coll_han_reproducible=false (default)
[1baf99fb0f7b:00209] coll_han_use_allgather_algorithm=default (default)
[1baf99fb0f7b:00209] coll_han_use_allreduce_algorithm=default (default)
[1baf99fb0f7b:00209] coll_han_use_barrier_algorithm=default (default)
[1baf99fb0f7b:00209] coll_han_use_bcast_algorithm=default (default)
[1baf99fb0f7b:00209] coll_han_use_gather_algorithm=default (default)
[1baf99fb0f7b:00209] coll_han_use_reduce_algorithm=default (default)
[1baf99fb0f7b:00209] coll_han_use_scatter_algorithm=default (default)
[1baf99fb0f7b:00209] coll_han_use_simple_allgather=false (default)
[1baf99fb0f7b:00209] coll_han_use_simple_allreduce=false (default)
[1baf99fb0f7b:00209] coll_han_use_simple_bcast=false (default)
[1baf99fb0f7b:00209] coll_han_use_simple_gather=true (default)
[1baf99fb0f7b:00209] coll_han_use_simple_reduce=false (default)
[1baf99fb0f7b:00209] coll_han_use_simple_scatter=false (default)
[1baf99fb0f7b:00209] coll_han_allgather_dynamic_intra_node_module=3 (default)
[1baf99fb0f7b:00209] coll_han_allgather_dynamic_inter_node_module=3 (default)
[1baf99fb0f7b:00209] coll_han_allgather_dynamic_global_communicator_module=6 (default)
[1baf99fb0f7b:00209] coll_han_allgatherv_dynamic_intra_node_module=3 (default)
[1baf99fb0f7b:00209] coll_han_allgatherv_dynamic_inter_node_module=3 (default)
[1baf99fb0f7b:00209] coll_han_allgatherv_dynamic_global_communicator_module=6 (default)
[1baf99fb0f7b:00209] coll_han_allreduce_dynamic_intra_node_module=3 (default)
[1baf99fb0f7b:00209] coll_han_allreduce_dynamic_inter_node_module=3 (default)
[1baf99fb0f7b:00209] coll_han_allreduce_dynamic_global_communicator_module=6 (default)
[1baf99fb0f7b:00209] coll_han_barrier_dynamic_intra_node_module=3 (default)
[1baf99fb0f7b:00209] coll_han_barrier_dynamic_inter_node_module=3 (default)
[1baf99fb0f7b:00209] coll_han_barrier_dynamic_global_communicator_module=6 (default)
[1baf99fb0f7b:00209] coll_han_bcast_dynamic_intra_node_module=3 (default)
[1baf99fb0f7b:00209] coll_han_bcast_dynamic_inter_node_module=3 (default)
[1baf99fb0f7b:00209] coll_han_bcast_dynamic_global_communicator_module=6 (default)
[1baf99fb0f7b:00209] coll_han_gather_dynamic_intra_node_module=3 (default)
[1baf99fb0f7b:00209] coll_han_gather_dynamic_inter_node_module=3 (default)
[1baf99fb0f7b:00209] coll_han_gather_dynamic_global_communicator_module=6 (default)
[1baf99fb0f7b:00209] coll_han_reduce_dynamic_intra_node_module=3 (default)
[1baf99fb0f7b:00209] coll_han_reduce_dynamic_inter_node_module=3 (default)
[1baf99fb0f7b:00209] coll_han_reduce_dynamic_global_communicator_module=6 (default)
[1baf99fb0f7b:00209] coll_han_scatter_dynamic_intra_node_module=3 (default)
[1baf99fb0f7b:00209] coll_han_scatter_dynamic_inter_node_module=3 (default)
[1baf99fb0f7b:00209] coll_han_scatter_dynamic_global_communicator_module=6 (default)
[1baf99fb0f7b:00209] coll_han_use_dynamic_file_rules=false (default)
[1baf99fb0f7b:00209] coll_han_dynamic_rules_filename= (default)
[1baf99fb0f7b:00209] coll_han_dump_dynamic_rules=false (default)
[1baf99fb0f7b:00209] coll_han_max_dynamic_errors=10 (default)
[1baf99fb0f7b:00209] coll_tuned_priority=30 (default)
[1baf99fb0f7b:00209] coll_tuned_init_tree_fanout=4 (default)
[1baf99fb0f7b:00209] coll_tuned_init_chain_fanout=4 (default)
[1baf99fb0f7b:00209] coll_tuned_alltoall_small_msg=200 (default)
[1baf99fb0f7b:00209] coll_tuned_alltoall_intermediate_msg=3000 (default)
[1baf99fb0f7b:00209] coll_tuned_use_dynamic_rules=false (default)
[1baf99fb0f7b:00209] coll_tuned_dynamic_rules_filename= (default)
[1baf99fb0f7b:00209] coll_tuned_allreduce_algorithm_count=7 (default)
[1baf99fb0f7b:00209] coll_tuned_allreduce_algorithm=ignore (default)
[1baf99fb0f7b:00209] coll_tuned_allreduce_algorithm_segmentsize=0 (default)
[1baf99fb0f7b:00209] coll_tuned_allreduce_algorithm_tree_fanout=4 (default)
[1baf99fb0f7b:00209] coll_tuned_allreduce_algorithm_chain_fanout=4 (default)
[1baf99fb0f7b:00209] coll_tuned_alltoall_algorithm_count=6 (default)
[1baf99fb0f7b:00209] coll_tuned_alltoall_algorithm=ignore (default)
[1baf99fb0f7b:00209] coll_tuned_alltoall_algorithm_segmentsize=0 (default)
[1baf99fb0f7b:00209] coll_tuned_alltoall_algorithm_tree_fanout=4 (default)
[1baf99fb0f7b:00209] coll_tuned_alltoall_algorithm_chain_fanout=4 (default)
[1baf99fb0f7b:00209] coll_tuned_alltoall_large_msg=3000 (default)
[1baf99fb0f7b:00209] coll_tuned_alltoall_min_procs=0 (default)
[1baf99fb0f7b:00209] coll_tuned_alltoall_algorithm_max_requests=0 (default)
[1baf99fb0f7b:00209] coll_tuned_allgather_algorithm_count=8 (default)
[1baf99fb0f7b:00209] coll_tuned_allgather_algorithm=ignore (default)
[1baf99fb0f7b:00209] coll_tuned_allgather_algorithm_segmentsize=0 (default)
[1baf99fb0f7b:00209] coll_tuned_allgather_algorithm_tree_fanout=4 (default)
[1baf99fb0f7b:00209] coll_tuned_allgather_algorithm_chain_fanout=4 (default)
[1baf99fb0f7b:00209] coll_tuned_allgatherv_algorithm_count=7 (default)
[1baf99fb0f7b:00209] coll_tuned_allgatherv_algorithm=ignore (default)
[1baf99fb0f7b:00209] coll_tuned_allgatherv_algorithm_segmentsize=0 (default)
[1baf99fb0f7b:00209] coll_tuned_allgatherv_algorithm_tree_fanout=4 (default)
[1baf99fb0f7b:00209] coll_tuned_allgatherv_algorithm_chain_fanout=4 (default)
[1baf99fb0f7b:00209] coll_tuned_alltoallv_algorithm_count=3 (default)
[1baf99fb0f7b:00209] coll_tuned_alltoallv_algorithm=ignore (default)
[1baf99fb0f7b:00209] coll_tuned_barrier_algorithm_count=7 (default)
[1baf99fb0f7b:00209] coll_tuned_barrier_algorithm=ignore (default)
[1baf99fb0f7b:00209] coll_tuned_bcast_algorithm_count=10 (default)
[1baf99fb0f7b:00209] coll_tuned_bcast_algorithm=ignore (default)
[1baf99fb0f7b:00209] coll_tuned_bcast_algorithm_segmentsize=0 (default)
[1baf99fb0f7b:00209] coll_tuned_bcast_algorithm_tree_fanout=4 (default)
[1baf99fb0f7b:00209] coll_tuned_bcast_algorithm_chain_fanout=4 (default)
[1baf99fb0f7b:00209] coll_tuned_bcast_algorithm_knomial_radix=4 (default)
[1baf99fb0f7b:00209] coll_tuned_reduce_algorithm_count=8 (default)
[1baf99fb0f7b:00209] coll_tuned_reduce_algorithm=ignore (default)
[1baf99fb0f7b:00209] coll_tuned_reduce_algorithm_segmentsize=0 (default)
[1baf99fb0f7b:00209] coll_tuned_reduce_algorithm_tree_fanout=4 (default)
[1baf99fb0f7b:00209] coll_tuned_reduce_algorithm_chain_fanout=4 (default)
[1baf99fb0f7b:00209] coll_tuned_reduce_algorithm_max_requests=0 (default)
[1baf99fb0f7b:00209] coll_tuned_reduce_scatter_algorithm_count=5 (default)
[1baf99fb0f7b:00209] coll_tuned_reduce_scatter_algorithm=ignore (default)
[1baf99fb0f7b:00209] coll_tuned_reduce_scatter_algorithm_segmentsize=0 (default)
[1baf99fb0f7b:00209] coll_tuned_reduce_scatter_algorithm_tree_fanout=4 (default)
[1baf99fb0f7b:00209] coll_tuned_reduce_scatter_algorithm_chain_fanout=4 (default)
[1baf99fb0f7b:00209] coll_tuned_reduce_scatter_block_algorithm_count=5 (default)
[1baf99fb0f7b:00209] coll_tuned_reduce_scatter_block_algorithm=ignore (default)
[1baf99fb0f7b:00209] coll_tuned_reduce_scatter_block_algorithm_segmentsize=0 (default)
[1baf99fb0f7b:00209] coll_tuned_reduce_scatter_block_algorithm_tree_fanout=4 (default)
[1baf99fb0f7b:00209] coll_tuned_gather_algorithm_count=4 (default)
[1baf99fb0f7b:00209] coll_tuned_gather_algorithm=ignore (default)
[1baf99fb0f7b:00209] coll_tuned_gather_algorithm_segmentsize=0 (default)
[1baf99fb0f7b:00209] coll_tuned_gather_algorithm_tree_fanout=4 (default)
[1baf99fb0f7b:00209] coll_tuned_gather_algorithm_chain_fanout=4 (default)
[1baf99fb0f7b:00209] coll_tuned_scatter_algorithm_count=4 (default)
[1baf99fb0f7b:00209] coll_tuned_scatter_algorithm=ignore (default)
[1baf99fb0f7b:00209] coll_tuned_scatter_algorithm_segmentsize=0 (default)
[1baf99fb0f7b:00209] coll_tuned_scatter_algorithm_tree_fanout=4 (default)
[1baf99fb0f7b:00209] coll_tuned_scatter_algorithm_chain_fanout=4 (default)
[1baf99fb0f7b:00209] coll_tuned_scatter_min_procs=0 (default)
[1baf99fb0f7b:00209] coll_tuned_scatter_algorithm_max_requests=0 (default)
[1baf99fb0f7b:00209] coll_tuned_scatter_intermediate_msg=0 (default)
[1baf99fb0f7b:00209] coll_tuned_scatter_large_msg=0 (default)
[1baf99fb0f7b:00209] coll_tuned_exscan_algorithm_count=3 (default)
[1baf99fb0f7b:00209] coll_tuned_exscan_algorithm=ignore (default)
[1baf99fb0f7b:00209] coll_tuned_scan_algorithm_count=3 (default)
[1baf99fb0f7b:00209] coll_tuned_scan_algorithm=ignore (default)
[1baf99fb0f7b:00209] coll_sync_priority=50 (default)
[1baf99fb0f7b:00209] coll_sync_barrier_before=0 (default)
[1baf99fb0f7b:00209] coll_sync_barrier_after=0 (default)
[1baf99fb0f7b:00209] coll_sm_priority=0 (default)
[1baf99fb0f7b:00209] coll_sm_control_size=4096 (default)
[1baf99fb0f7b:00209] coll_sm_fragment_size=8192 (default)
[1baf99fb0f7b:00209] coll_sm_comm_in_use_flags=2 (default)
[1baf99fb0f7b:00209] coll_sm_comm_num_segments=8 (default)
[1baf99fb0f7b:00209] coll_sm_tree_degree=4 (default)
[1baf99fb0f7b:00209] coll_sm_info_num_procs=4 (default)
[1baf99fb0f7b:00209] coll_sm_shared_mem_used_data=548864 (default)
[1baf99fb0f7b:00209] coll_cuda_priority=78 (default)
[1baf99fb0f7b:00209] coll_cuda_disable_cuda_coll=0 (default)
[1baf99fb0f7b:00209] coll_adapt_priority=0 (default)
[1baf99fb0f7b:00209] coll_adapt_verbose=0 (default)
[1baf99fb0f7b:00209] coll_adapt_context_free_list_min=64 (default)
[1baf99fb0f7b:00209] coll_adapt_context_free_list_max=1024 (default)
[1baf99fb0f7b:00209] coll_adapt_context_free_list_inc=32 (default)
[1baf99fb0f7b:00209] coll_adapt_bcast_algorithm=1 (default)
[1baf99fb0f7b:00209] coll_adapt_bcast_segment_size=0 (default)
[1baf99fb0f7b:00209] coll_adapt_bcast_max_send_requests=2 (default)
[1baf99fb0f7b:00209] coll_adapt_bcast_max_recv_requests=3 (default)
[1baf99fb0f7b:00209] coll_adapt_bcast_synchronous_send=true (default)
[1baf99fb0f7b:00209] coll_adapt_reduce_algorithm=1 (default)
[1baf99fb0f7b:00209] coll_adapt_reduce_segment_size=163740 (default)
[1baf99fb0f7b:00209] coll_adapt_reduce_max_send_requests=2 (default)
[1baf99fb0f7b:00209] coll_adapt_reduce_max_recv_requests=3 (default)
[1baf99fb0f7b:00209] coll_adapt_inbuf_free_list_min=10 (default)
[1baf99fb0f7b:00209] coll_adapt_inbuf_free_list_max=10000 (default)
[1baf99fb0f7b:00209] coll_adapt_inbuf_free_list_inc=10 (default)
[1baf99fb0f7b:00209] coll_adapt_reduce_synchronous_send=true (default)
[1baf99fb0f7b:00209] coll_libnbc_priority=10 (default)
[1baf99fb0f7b:00209] coll_libnbc_ibcast_skip_dt_decision=true (default)
[1baf99fb0f7b:00209] coll_libnbc_iallgather_algorithm=ignore (default)
[1baf99fb0f7b:00209] coll_libnbc_iallreduce_algorithm=ignore (default)
[1baf99fb0f7b:00209] coll_libnbc_ibcast_algorithm=ignore (default)
[1baf99fb0f7b:00209] coll_libnbc_ibcast_knomial_radix=4 (default)
[1baf99fb0f7b:00209] coll_libnbc_iexscan_algorithm=ignore (default)
[1baf99fb0f7b:00209] coll_libnbc_ireduce_algorithm=ignore (default)
[1baf99fb0f7b:00209] coll_libnbc_iscan_algorithm=ignore (default)
[1baf99fb0f7b:00209] coll_self_priority=75 (default)
[1baf99fb0f7b:00209] coll_inter_priority=40 (default)
[1baf99fb0f7b:00209] coll_inter_verbose=0 (default)
[1baf99fb0f7b:00209] coll_ftagree_priority=30 (default)
[1baf99fb0f7b:00209] coll_ftagree_agreement=65535 (default)
[1baf99fb0f7b:00209] coll_ftagree_era_topology=1 (default)
[1baf99fb0f7b:00209] coll_ftagree_era_rebuild=0 (default)
[1baf99fb0f7b:00209] coll_basic_priority=10 (default)
[1baf99fb0f7b:00209] coll_basic_crossover=4 (default)
[1baf99fb0f7b:00209] coll_ucc_priority=10 (default)
[1baf99fb0f7b:00209] coll_ucc_verbose=0 (default)
[1baf99fb0f7b:00209] coll_ucc_enable=0 (default)
[1baf99fb0f7b:00209] coll_ucc_np=2 (default)
[1baf99fb0f7b:00209] coll_ucc_print_compiletime_version=1.3.0 (default)
[1baf99fb0f7b:00209] coll_ucc_print_runtime_version= (default)
[1baf99fb0f7b:00209] coll_ucc_cls= (default)
[1baf99fb0f7b:00209] coll_ucc_cts=barrier,bcast,allreduce,alltoall,alltoallv,allgather,allgatherv,reduce,gather,gatherv,reduce_scatter_block,reduce_scatter,scatterv,scatter,ibarrier,ibcast,iallreduce,ialltoall,ialltoallv,iallgather,iallgatherv,ireduce,igather,igatherv,ireduce_scatter_block,ireduce_scatter,iscatterv,iscatter (default)
[1baf99fb0f7b:00209] osc= (default)
[1baf99fb0f7b:00209] osc_base_verbose=error (default)
[1baf99fb0f7b:00209] osc_rdma_no_locks=false (default)
[1baf99fb0f7b:00209] osc_rdma_acc_single_intrinsic=false (default)
[1baf99fb0f7b:00209] osc_rdma_acc_use_amo=true (default)
[1baf99fb0f7b:00209] osc_rdma_buffer_size=32768 (default)
[1baf99fb0f7b:00209] osc_rdma_max_attach=64 (default)
[1baf99fb0f7b:00209] osc_rdma_priority=20 (default)
[1baf99fb0f7b:00209] osc_rdma_locking_mode=two_level (default)
[1baf99fb0f7b:00209] osc_rdma_btls=ugni,uct,ofi (default)
[1baf99fb0f7b:00209] osc_rdma_backing_directory=/dev/shm (default)
[1baf99fb0f7b:00209] osc_rdma_network_max_amo=32 (default)
[1baf99fb0f7b:00209] osc_rdma_minimum_memory_alignment=4096 (default)
[1baf99fb0f7b:00209] osc_sm_backing_directory=/dev/shm (default)
[1baf99fb0f7b:00209] osc_sm_priority=100 (default)
[1baf99fb0f7b:00209] osc_ucx_priority=60 (default)
[1baf99fb0f7b:00209] osc_ucx_no_locks=false (default)
[1baf99fb0f7b:00209] osc_ucx_acc_single_intrinsic=false (default)
[1baf99fb0f7b:00209] osc_ucx_enable_nonblocking_accumulate=false (default)
[1baf99fb0f7b:00209] osc_ucx_enable_wpool_thread_multiple=true (default)
[1baf99fb0f7b:00209] osc_ucx_outstanding_ops_flush_threshold=64 (default)
[1baf99fb0f7b:00209] osc_ucx_verbose=0 (default)
[1baf99fb0f7b:00209] osc_ucx_progress_iterations=100 (default)
[1baf99fb0f7b:00209] osc_ucx_opal_mem_hooks=true (default)
[1baf99fb0f7b:00209] osc_ucx_tls=rc_verbs,ud_verbs,rc_mlx5,dc_mlx5,ud_mlx5,cuda_ipc,rocm_ipc (default)
[1baf99fb0f7b:00209] osc_ucx_devices=mlx* (default)
[1baf99fb0f7b:00209] osc_ucx_backing_directory=/dev/shm (default)
[1baf99fb0f7b:00209] btl_tcp_bandwidth_eth0=10000 (default)
[1baf99fb0f7b:00209] btl_tcp_latency_eth0=100 (default)
[1baf99fb0f7b:00209] btl_tcp_bandwidth_eth0:0=10000 (default)
[1baf99fb0f7b:00209] btl_tcp_latency_eth0:0=100 (default)
[1baf99fb0f7b:00209] part= (default)
[1baf99fb0f7b:00209] part_base_verbose=error (default)
[1baf99fb0f7b:00209] part_persist_free_list_num=4 (default)
[1baf99fb0f7b:00209] part_persist_free_list_max=-1 (default)
[1baf99fb0f7b:00209] part_persist_free_list_inc=64 (default)

So the defaults are still:

  • pml=ob1 (pml=ucx does not appear to be available? Maybe a bug here?)
  • osc=sm
  • coll_ucc_enable=0

meaning (if I've understood correctly) that none of the ucc/ucx lines have any effect on the defaults once they can be loaded at all.

@dalcinl
Copy link
Contributor Author

dalcinl commented Nov 6, 2024

@minrk I definitely second your comments. As I clarified in my previous comment, this PR so far deals with the build part, all the other stuff is still TODO.

Regarding whether we should wait for a version bump, maybe not. Sometimes I worry too much.
If we go with this change, and anyone complains about the aftermath, they are more than welcome to join us a recipe maintainers and offer a solution.

So the defaults are still:

Again, this is because this PR has only implemented what's necessary for building. IIRC, UCX is still disabled by default.

@minrk
Copy link
Member

minrk commented Nov 6, 2024

this is because this PR has only implemented what's necessary for building.

To be clear, I was running the test in an env after removing the mca conf file which disables these (not with this PR), but the latest published build with:

mamba install openmpi ucc ucx
rm $PREFIX/etc/openmpi-mca-params.conf

So what I pasted are the defaults after removing the parameters we are setting, meaning that the disables we are adding are already having no effect on the defaults once ucc/ucx are available. So even removing those lines won't actually change the defaults. I think that means we can safely remove both the param overrides and the messages for ucc/ucx in this PR, since the defaults will not be changed by their removal.

@dalcinl
Copy link
Contributor Author

dalcinl commented Nov 6, 2024

So even removing those lines won't actually change the defaults

Please read the following excerpt from Open MPI's documentation:

While users can manually select any of the above transports at run time, Open MPI will select a default transport as follows:
If InfiniBand devices are available, use the UCX PML.
If PSM, PSM2, or other tag-matching-supporting Libfabric transport devices are available (e.g., Cray uGNI), use the cm PML and a single appropriate corresponding mtl module.
Otherwise, use the ob1 PML and one or more appropriate btl modules.

So my guess is you performed your testing on a machine without any special network hardware...

@minrk
Copy link
Member

minrk commented Nov 6, 2024

Got it, thank you. I assumed the priority values were meaningful or fixed by the build rather than the runtime environment, but apparently not.

I still support the conda packages having openmpi's default behavior rather than being unique now that there's no longer a reason to exclude ucx, but I am also okay if folks prefer waiting for a version to make that transition.

@dalcinl
Copy link
Contributor Author

dalcinl commented Nov 6, 2024

I still support the conda packages having openmpi's default behavior

If no one speaks on the contrary in the next couple days, let's go for it. However, it would be nice to have a libfabric package up and running to use it as a dependency.

@minrk
Copy link
Member

minrk commented Nov 6, 2024

Yes, and I think we will before too long (conda-forge/staged-recipes#27988). Probably best to tackle one thing at a time, though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants