Unexpected error: perf_analyzer did not produce any output. It was likely terminated with a SIGABRT #750

wxthu · 2023-08-23T04:01:20Z

When I use model analyzer to profile multiple-models composed of resnext models in torchvision using pytorch backend, I always get the following info and get nothing in profile_results directory,

[Model Analyzer] perf_analyzer took very long to exit, killing perf_analyzer
[Model Analyzer] perf_analyzer did not produce any output.
[Model Analyzer] No changes made to analyzer data, no checkpoint saved.
Traceback (most recent call last):
  File "/usr/local/bin/model-analyzer", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/model_analyzer/entrypoint.py", line 267, in main
    analyzer.profile(client=client,
  File "/usr/local/lib/python3.10/dist-packages/model_analyzer/analyzer.py", line 116, in profile
    self._profile_models()
  File "/usr/local/lib/python3.10/dist-packages/model_analyzer/analyzer.py", line 220, in _profile_models
    self._model_manager.run_models(models=models)
  File "/usr/local/lib/python3.10/dist-packages/model_analyzer/model_manager.py", line 130, in run_models
    self._stop_ma_if_no_valid_measurement_threshold_reached()
  File "/usr/local/lib/python3.10/dist-packages/model_analyzer/model_manager.py", line 218, in _stop_ma_if_no_valid_measurement_threshold_reached
    raise TritonModelAnalyzerException(
model_analyzer.model_analyzer_exceptions.TritonModelAnalyzerException: The first 2 attempts to acquire measurements have failed. Please examine the Tritonserver/PA error logs to determine what has gone wrong.

when I check PA logs generated by MA, I found that:

Command:
mpiexec --allow-run-as-root --tag-output -n 1 perf_analyzer --enable-mpi -m resnext_torch_config_default -b 1 -u localhost:8001 -i grpc -f resnext_torch_config_default-results.csv --verbose-csv --concurrency-range 1 --measurement-mode count_windows --collect-metrics --metrics-url http://localhost:8002/metrics --metrics-interval 1000.0 : -n 1 perf_analyzer --enable-mpi -m resnext_torch2_config_default -b 1 -u localhost:8001 -i grpc -f resnext_torch2_config_default-results.csv --verbose-csv --concurrency-range 1 --measurement-mode count_windows --collect-metrics --metrics-url http://localhost:8002/metrics --metrics-interval 1000.0

Error: perf_analyzer did not produce any output. It was likely terminated with a SIGABRT

I do not know how to solve it

Further debugging, when I execute the command in PA logs above, I met the following error:

[1,1]<stdout>:[1692767719.500504] [xiangwang-System-Product-Name:1324 :0]     ucp_context.c:1774 UCX  WARN  UCP version is incompatible, required: 1.15, actual: 1.12 (release 1)
[1,0]<stdout>:[1692767719.533941] [xiangwang-System-Product-Name:1323 :0]     ucp_context.c:1774 UCX  WARN  UCP version is incompatible, required: 1.15, actual: 1.12 (release 1)
[1,0]<stdout>:[1692767719.540637] [xiangwang-System-Product-Name:1323 :0]     ucp_context.c:1774 UCX  WARN  UCP version is incompatible, required: 1.15, actual: 1.12 (release 1)
[1,1]<stdout>:[1692767719.540801] [xiangwang-System-Product-Name:1324 :0]     ucp_context.c:1774 UCX  WARN  UCP version is incompatible, required: 1.15, actual: 1.12 (release 1)
[1,0]<stderr>:error: failed to get model metadata: HTTP client failed: Couldn't connect to server
[1,0]<stderr>:
[1,1]<stderr>:error: failed to get model metadata: HTTP client failed: Couldn't connect to server
[1,1]<stderr>:
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[53772,1],0]
  Exit code:    99

The text was updated successfully, but these errors were encountered:

nv-braf · 2023-08-23T15:15:55Z

Can you post the full command you are using to run MA?
Can you post the tritonserver output log?
Have you confirmed that you can load and sucessfully run the model on a tritonserver?

wxthu · 2023-08-24T01:20:21Z

My command running MA as follows:
model-analyzer profile -f config2.yml --triton-launch-mode=docker --triton-docker-shm-size=4g --output-model-repository-path /home/llc/triton-multi-model-profile/results --export-path profile_results

The output log in triton server :

[Model Analyzer] Initializing GPUDevice handles
[Model Analyzer] Using GPU 0 NVIDIA GeForce RTX 4090 with UUID GPU-c81ce6eb-cb1a-d91e-4d7f-0aa4bcdd0ef7
[Model Analyzer] WARNING: Overriding the output model repo path "/home/llc/triton-multi-model-profile/results"
[Model Analyzer] Starting a Triton Server using docker
[Model Analyzer] No checkpoint file found, starting a fresh run.
[Model Analyzer] Profiling server only metrics...
[Model Analyzer] 
[Model Analyzer] Starting quick mode search to find optimal configs
[Model Analyzer] 
[Model Analyzer] 
[Model Analyzer] Creating model config: resnext_torch_config_default
[Model Analyzer] 
[Model Analyzer] 
[Model Analyzer] Creating model config: resnext_torch2_config_default
[Model Analyzer] 
[Model Analyzer] Profiling resnext_torch_config_default: client batch size=1, concurrency=1
[Model Analyzer] Profiling resnext_torch2_config_default: client batch size=1, concurrency=1
[Model Analyzer]

and long long stuck and finally generated above error...

nv-braf · 2023-08-24T14:40:41Z

Can you post the the contents of the config2.yml file?
Can you also post/generate the triton server output log (you can do this by using the --triton-output-path option)?
Have you confirmed that you can load and successfully run the model on a tritonserver?

wxthu · 2023-08-24T15:18:21Z

I have confirmed that I load and successfully run the model on a tritionserver
and my config file as follows:

model_repository: "/home/llc/multi-model-triton/examples/quick-start"
run_config_profile_models_concurrently_enable: true
override_output_model_repository: true
client_protocol: "http"
run_config_search_mode: quick
run_config_search_max_instance_count: 1
run_config_search_max_concurrency: 1
run_config_search_min_model_batch_size: 5
run_config_search_max_model_batch_size: 5
num_configs_per_model: 4

profile_models:
  resnext_torch:
    constraints:
      perf_latency_p99:
        max: 20
    model_config_parameters:
      instance_group:
        - kind: KIND_GPU
          count: 1
  resnext_torch2:
    constraints:
      perf_latency_p99:
        max: 20
    model_config_parameters:
      instance_group:
        - kind: KIND_GPU
          count: 1

and triton-server generated log inside attachment

nv-braf · 2023-08-24T15:21:26Z

Thank you. @matthewkotila this looks like an issue in Perf Analyzer. I don't see anything wrong with MA or the tritonserver.

matthewkotila · 2023-08-24T15:30:17Z

@wxthu @nv-braf can either of y'all do me a favor and give me minimally reproducible commands that only run PA? This will help my debugging a bunch.

nv-braf · 2023-08-24T16:18:57Z

Is this what you're looking for?

Command:
mpiexec --allow-run-as-root --tag-output -n 1 perf_analyzer --enable-mpi -m resnext_torch_config_default -b 1 -u localhost:8001 -i grpc -f resnext_torch_config_default-results.csv --verbose-csv --concurrency-range 1 --measurement-mode count_windows --collect-metrics --metrics-url http://localhost:8002/metrics --metrics-interval 1000.0 : -n 1 perf_analyzer --enable-mpi -m resnext_torch2_config_default -b 1 -u localhost:8001 -i grpc -f resnext_torch2_config_default-results.csv --verbose-csv --concurrency-range 1 --measurement-mode count_windows --collect-metrics --metrics-url http://localhost:8002/metrics --metrics-interval 1000.0

Error: perf_analyzer did not produce any output. It was likely terminated with a SIGABRT

matthewkotila · 2023-08-24T22:31:05Z

Ooh gotcha, missed that--thanks!

We're tracking this but we do not currently have an estimate of when we will complete debugging the issue.

anshudaur · 2023-09-13T08:23:40Z

Facing the same issue : Error: perf_analyzer did not produce any output. It was likely terminated with a SIGABRT

riyajatar37003 · 2024-05-15T12:28:09Z

any update on this
perf_analyzer took very long to exit, killing perf_analyzer
[Model Analyzer] perf_analyzer did not produce any output.

riyajatar37003 · 2024-05-15T12:38:43Z

[Model Analyzer] Initializing GPUDevice handles
[Model Analyzer] Using GPU 0 NVIDIA A100-SXM4-40GB with UUID GPU-d9a0447f-f8fa-9d2f-79fc-ecf2567dacc2
[Model Analyzer] WARNING: Overriding the output model repo path "./rerenker_output1"
[Model Analyzer] Starting a local Triton Server
[Model Analyzer] Loaded checkpoint from file /model_repositories/checkpoints/0.ckpt
[Model Analyzer] GPU devices match checkpoint - skipping server metric acquisition
[Model Analyzer]
[Model Analyzer] Starting quick mode search to find optimal configs
[Model Analyzer]
[Model Analyzer] Creating model config: reranker_config_default
[Model Analyzer]
[Model Analyzer] Creating model config: bge_reranker_v2_onnx_config_default
[Model Analyzer]
[Model Analyzer] Profiling reranker_config_default: client batch size=1, concurrency=24
[Model Analyzer] Profiling bge_reranker_v2_onnx_config_default: client batch size=1, concurrency=8
[Model Analyzer]
[Model Analyzer] perf_analyzer took very long to exit, killing perf_analyzer
[Model Analyzer] perf_analyzer did not produce any output.
[Model Analyzer] Saved checkpoint to model_repositories/checkpoints/1.ckpt
[Model Analyzer] Creating model config: reranker_config_0
[Model Analyzer] Setting instance_group to [{'count': 1, 'kind': 'KIND_GPU'}]
[Model Analyzer] Setting max_batch_size to 1
[Model Analyzer] Enabling dynamic_batching
[Model Analyzer]
[Model Analyzer] Creating model config: bge_reranker_v2_onnx_config_0
[Model Analyzer] Setting instance_group to [{'count': 1, 'kind': 'KIND_GPU'}]
[Model Analyzer] Setting max_batch_size to 1
[Model Analyzer] Enabling dynamic_batching
[Model Analyzer]
[Model Analyzer] Profiling reranker_config_0: client batch size=1, concurrency=2
[Model Analyzer] Profiling bge_reranker_v2_onnx_config_0: client batch size=1, concurrency=2
[Model Analyzer]
[Model Analyzer] perf_analyzer took very long to exit, killing perf_analyzer
[Model Analyzer] perf_analyzer did not produce any output.
[Model Analyzer] No changes made to analyzer data, no checkpoint saved.
Traceback (most recent call last):
File "/opt/app_venv/bin/model-analyzer", line 8, in
sys.exit(main())
File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/entrypoint.py", line 278, in main
analyzer.profile(
File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/analyzer.py", line 124, in profile
self._profile_models()
File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/analyzer.py", line 233, in _profile_models
self._model_manager.run_models(models=models)
File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/model_manager.py", line 145, in run_models
self._stop_ma_if_no_valid_measurement_threshold_reached()
File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/model_manager.py", line 239, in _stop_ma_if_no_valid_measurement_threshold_reached
raise TritonModelAnalyzerException(
model_analyzer.model_analyzer_exceptions.TritonModelAnalyzerException: The first 2 attempts to acquire measurements have failed. Please examine the Tritonserver/PA error logs to determine what has gone wrong.

riyajatar37003 · 2024-05-15T14:06:18Z

perf_error log

Command:
mpiexec --allow-run-as-root --tag-output -n 1 perf_analyzer --enable-mpi -m reranker -b 1 -u localhost:8001 -i grpc -f results.csv --verbose-csv --concurrency-range 24 --measurement-mode count_windows --collect-metrics --metrics-url http://localhost:8002/metrics --metrics-interval 1000 : -n 1 perf_analyzer --enable-mpi -m onnx -b 1 -u localhost:8001 -i grpc -f results.csv --verbose-csv --concurrency-range 8 --measurement-mode count_windows --collect-metrics --metrics-url http://localhost:8002/metrics --metrics-interval 1000

Error: perf_analyzer did not produce any output. It was likely terminated with a SIGABRT.

dyastremsky changed the title ~~MA do not work~~ Unexpected error: perf_analyzer did not produce any output. It was likely terminated with a SIGABRT Oct 19, 2023

dyastremsky added the bug Something isn't working label Oct 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected error: perf_analyzer did not produce any output. It was likely terminated with a SIGABRT #750

Unexpected error: perf_analyzer did not produce any output. It was likely terminated with a SIGABRT #750

wxthu commented Aug 23, 2023 •

edited

Loading

nv-braf commented Aug 23, 2023

wxthu commented Aug 24, 2023

nv-braf commented Aug 24, 2023

wxthu commented Aug 24, 2023

nv-braf commented Aug 24, 2023

matthewkotila commented Aug 24, 2023

nv-braf commented Aug 24, 2023

matthewkotila commented Aug 24, 2023 •

edited

Loading

anshudaur commented Sep 13, 2023

riyajatar37003 commented May 15, 2024

riyajatar37003 commented May 15, 2024 •

edited

Loading

riyajatar37003 commented May 15, 2024 •

edited

Loading

Unexpected error: perf_analyzer did not produce any output. It was likely terminated with a SIGABRT #750

Unexpected error: perf_analyzer did not produce any output. It was likely terminated with a SIGABRT #750

Comments

wxthu commented Aug 23, 2023 • edited Loading

nv-braf commented Aug 23, 2023

wxthu commented Aug 24, 2023

nv-braf commented Aug 24, 2023

wxthu commented Aug 24, 2023

nv-braf commented Aug 24, 2023

matthewkotila commented Aug 24, 2023

nv-braf commented Aug 24, 2023

matthewkotila commented Aug 24, 2023 • edited Loading

anshudaur commented Sep 13, 2023

riyajatar37003 commented May 15, 2024

riyajatar37003 commented May 15, 2024 • edited Loading

riyajatar37003 commented May 15, 2024 • edited Loading

wxthu commented Aug 23, 2023 •

edited

Loading

matthewkotila commented Aug 24, 2023 •

edited

Loading

riyajatar37003 commented May 15, 2024 •

edited

Loading

riyajatar37003 commented May 15, 2024 •

edited

Loading