Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected error: perf_analyzer did not produce any output. It was likely terminated with a SIGABRT #750

Open
wxthu opened this issue Aug 23, 2023 · 12 comments
Labels
bug Something isn't working

Comments

@wxthu
Copy link

wxthu commented Aug 23, 2023

When I use model analyzer to profile multiple-models composed of resnext models in torchvision using pytorch backend, I always get the following info and get nothing in profile_results directory,

[Model Analyzer] perf_analyzer took very long to exit, killing perf_analyzer
[Model Analyzer] perf_analyzer did not produce any output.
[Model Analyzer] No changes made to analyzer data, no checkpoint saved.
Traceback (most recent call last):
  File "/usr/local/bin/model-analyzer", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/model_analyzer/entrypoint.py", line 267, in main
    analyzer.profile(client=client,
  File "/usr/local/lib/python3.10/dist-packages/model_analyzer/analyzer.py", line 116, in profile
    self._profile_models()
  File "/usr/local/lib/python3.10/dist-packages/model_analyzer/analyzer.py", line 220, in _profile_models
    self._model_manager.run_models(models=models)
  File "/usr/local/lib/python3.10/dist-packages/model_analyzer/model_manager.py", line 130, in run_models
    self._stop_ma_if_no_valid_measurement_threshold_reached()
  File "/usr/local/lib/python3.10/dist-packages/model_analyzer/model_manager.py", line 218, in _stop_ma_if_no_valid_measurement_threshold_reached
    raise TritonModelAnalyzerException(
model_analyzer.model_analyzer_exceptions.TritonModelAnalyzerException: The first 2 attempts to acquire measurements have failed. Please examine the Tritonserver/PA error logs to determine what has gone wrong.

when I check PA logs generated by MA, I found that:

Command:
mpiexec --allow-run-as-root --tag-output -n 1 perf_analyzer --enable-mpi -m resnext_torch_config_default -b 1 -u localhost:8001 -i grpc -f resnext_torch_config_default-results.csv --verbose-csv --concurrency-range 1 --measurement-mode count_windows --collect-metrics --metrics-url http://localhost:8002/metrics --metrics-interval 1000.0 : -n 1 perf_analyzer --enable-mpi -m resnext_torch2_config_default -b 1 -u localhost:8001 -i grpc -f resnext_torch2_config_default-results.csv --verbose-csv --concurrency-range 1 --measurement-mode count_windows --collect-metrics --metrics-url http://localhost:8002/metrics --metrics-interval 1000.0

Error: perf_analyzer did not produce any output. It was likely terminated with a SIGABRT

I do not know how to solve it

Further debugging, when I execute the command in PA logs above, I met the following error:

[1,1]<stdout>:[1692767719.500504] [xiangwang-System-Product-Name:1324 :0]     ucp_context.c:1774 UCX  WARN  UCP version is incompatible, required: 1.15, actual: 1.12 (release 1)
[1,0]<stdout>:[1692767719.533941] [xiangwang-System-Product-Name:1323 :0]     ucp_context.c:1774 UCX  WARN  UCP version is incompatible, required: 1.15, actual: 1.12 (release 1)
[1,0]<stdout>:[1692767719.540637] [xiangwang-System-Product-Name:1323 :0]     ucp_context.c:1774 UCX  WARN  UCP version is incompatible, required: 1.15, actual: 1.12 (release 1)
[1,1]<stdout>:[1692767719.540801] [xiangwang-System-Product-Name:1324 :0]     ucp_context.c:1774 UCX  WARN  UCP version is incompatible, required: 1.15, actual: 1.12 (release 1)
[1,0]<stderr>:error: failed to get model metadata: HTTP client failed: Couldn't connect to server
[1,0]<stderr>:
[1,1]<stderr>:error: failed to get model metadata: HTTP client failed: Couldn't connect to server
[1,1]<stderr>:
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[53772,1],0]
  Exit code:    99
@nv-braf
Copy link
Contributor

nv-braf commented Aug 23, 2023

Can you post the full command you are using to run MA?
Can you post the tritonserver output log?
Have you confirmed that you can load and sucessfully run the model on a tritonserver?

@wxthu
Copy link
Author

wxthu commented Aug 24, 2023

My command running MA as follows:
model-analyzer profile -f config2.yml --triton-launch-mode=docker --triton-docker-shm-size=4g --output-model-repository-path /home/llc/triton-multi-model-profile/results --export-path profile_results

The output log in triton server :

[Model Analyzer] Initializing GPUDevice handles
[Model Analyzer] Using GPU 0 NVIDIA GeForce RTX 4090 with UUID GPU-c81ce6eb-cb1a-d91e-4d7f-0aa4bcdd0ef7
[Model Analyzer] WARNING: Overriding the output model repo path "/home/llc/triton-multi-model-profile/results"
[Model Analyzer] Starting a Triton Server using docker
[Model Analyzer] No checkpoint file found, starting a fresh run.
[Model Analyzer] Profiling server only metrics...
[Model Analyzer] 
[Model Analyzer] Starting quick mode search to find optimal configs
[Model Analyzer] 
[Model Analyzer] 
[Model Analyzer] Creating model config: resnext_torch_config_default
[Model Analyzer] 
[Model Analyzer] 
[Model Analyzer] Creating model config: resnext_torch2_config_default
[Model Analyzer] 
[Model Analyzer] Profiling resnext_torch_config_default: client batch size=1, concurrency=1
[Model Analyzer] Profiling resnext_torch2_config_default: client batch size=1, concurrency=1
[Model Analyzer] 

and long long stuck and finally generated above error...

@nv-braf
Copy link
Contributor

nv-braf commented Aug 24, 2023

Can you post the the contents of the config2.yml file?
Can you also post/generate the triton server output log (you can do this by using the --triton-output-path option)?
Have you confirmed that you can load and successfully run the model on a tritonserver?

@wxthu
Copy link
Author

wxthu commented Aug 24, 2023

I have confirmed that I load and successfully run the model on a tritionserver
and my config file as follows:

model_repository: "/home/llc/multi-model-triton/examples/quick-start"
run_config_profile_models_concurrently_enable: true
override_output_model_repository: true
client_protocol: "http"
run_config_search_mode: quick
run_config_search_max_instance_count: 1
run_config_search_max_concurrency: 1
run_config_search_min_model_batch_size: 5
run_config_search_max_model_batch_size: 5
num_configs_per_model: 4

profile_models:
  resnext_torch:
    constraints:
      perf_latency_p99:
        max: 20
    model_config_parameters:
      instance_group:
        - kind: KIND_GPU
          count: 1
  resnext_torch2:
    constraints:
      perf_latency_p99:
        max: 20
    model_config_parameters:
      instance_group:
        - kind: KIND_GPU
          count: 1

and triton-server generated log inside attachment

@nv-braf
Copy link
Contributor

nv-braf commented Aug 24, 2023

Thank you. @matthewkotila this looks like an issue in Perf Analyzer. I don't see anything wrong with MA or the tritonserver.

@matthewkotila
Copy link
Contributor

@wxthu @nv-braf can either of y'all do me a favor and give me minimally reproducible commands that only run PA? This will help my debugging a bunch.

@nv-braf
Copy link
Contributor

nv-braf commented Aug 24, 2023

Is this what you're looking for?

Command:
mpiexec --allow-run-as-root --tag-output -n 1 perf_analyzer --enable-mpi -m resnext_torch_config_default -b 1 -u localhost:8001 -i grpc -f resnext_torch_config_default-results.csv --verbose-csv --concurrency-range 1 --measurement-mode count_windows --collect-metrics --metrics-url http://localhost:8002/metrics --metrics-interval 1000.0 : -n 1 perf_analyzer --enable-mpi -m resnext_torch2_config_default -b 1 -u localhost:8001 -i grpc -f resnext_torch2_config_default-results.csv --verbose-csv --concurrency-range 1 --measurement-mode count_windows --collect-metrics --metrics-url http://localhost:8002/metrics --metrics-interval 1000.0

Error: perf_analyzer did not produce any output. It was likely terminated with a SIGABRT

@matthewkotila
Copy link
Contributor

matthewkotila commented Aug 24, 2023

Ooh gotcha, missed that--thanks!

We're tracking this but we do not currently have an estimate of when we will complete debugging the issue.

@anshudaur
Copy link

Facing the same issue : Error: perf_analyzer did not produce any output. It was likely terminated with a SIGABRT

@dyastremsky dyastremsky changed the title MA do not work Unexpected error: perf_analyzer did not produce any output. It was likely terminated with a SIGABRT Oct 19, 2023
@dyastremsky dyastremsky added the bug Something isn't working label Oct 19, 2023
@riyajatar37003
Copy link

any update on this
perf_analyzer took very long to exit, killing perf_analyzer
[Model Analyzer] perf_analyzer did not produce any output.

@riyajatar37003
Copy link

riyajatar37003 commented May 15, 2024

[Model Analyzer] Initializing GPUDevice handles
[Model Analyzer] Using GPU 0 NVIDIA A100-SXM4-40GB with UUID GPU-d9a0447f-f8fa-9d2f-79fc-ecf2567dacc2
[Model Analyzer] WARNING: Overriding the output model repo path "./rerenker_output1"
[Model Analyzer] Starting a local Triton Server
[Model Analyzer] Loaded checkpoint from file /model_repositories/checkpoints/0.ckpt
[Model Analyzer] GPU devices match checkpoint - skipping server metric acquisition
[Model Analyzer]
[Model Analyzer] Starting quick mode search to find optimal configs
[Model Analyzer]
[Model Analyzer] Creating model config: reranker_config_default
[Model Analyzer]
[Model Analyzer] Creating model config: bge_reranker_v2_onnx_config_default
[Model Analyzer]
[Model Analyzer] Profiling reranker_config_default: client batch size=1, concurrency=24
[Model Analyzer] Profiling bge_reranker_v2_onnx_config_default: client batch size=1, concurrency=8
[Model Analyzer]
[Model Analyzer] perf_analyzer took very long to exit, killing perf_analyzer
[Model Analyzer] perf_analyzer did not produce any output.
[Model Analyzer] Saved checkpoint to model_repositories/checkpoints/1.ckpt
[Model Analyzer] Creating model config: reranker_config_0
[Model Analyzer] Setting instance_group to [{'count': 1, 'kind': 'KIND_GPU'}]
[Model Analyzer] Setting max_batch_size to 1
[Model Analyzer] Enabling dynamic_batching
[Model Analyzer]
[Model Analyzer] Creating model config: bge_reranker_v2_onnx_config_0
[Model Analyzer] Setting instance_group to [{'count': 1, 'kind': 'KIND_GPU'}]
[Model Analyzer] Setting max_batch_size to 1
[Model Analyzer] Enabling dynamic_batching
[Model Analyzer]
[Model Analyzer] Profiling reranker_config_0: client batch size=1, concurrency=2
[Model Analyzer] Profiling bge_reranker_v2_onnx_config_0: client batch size=1, concurrency=2
[Model Analyzer]
[Model Analyzer] perf_analyzer took very long to exit, killing perf_analyzer
[Model Analyzer] perf_analyzer did not produce any output.
[Model Analyzer] No changes made to analyzer data, no checkpoint saved.
Traceback (most recent call last):
File "/opt/app_venv/bin/model-analyzer", line 8, in
sys.exit(main())
File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/entrypoint.py", line 278, in main
analyzer.profile(
File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/analyzer.py", line 124, in profile
self._profile_models()
File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/analyzer.py", line 233, in _profile_models
self._model_manager.run_models(models=models)
File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/model_manager.py", line 145, in run_models
self._stop_ma_if_no_valid_measurement_threshold_reached()
File "/opt/app_venv/lib/python3.10/site-packages/model_analyzer/model_manager.py", line 239, in _stop_ma_if_no_valid_measurement_threshold_reached
raise TritonModelAnalyzerException(
model_analyzer.model_analyzer_exceptions.TritonModelAnalyzerException: The first 2 attempts to acquire measurements have failed. Please examine the Tritonserver/PA error logs to determine what has gone wrong.

@riyajatar37003
Copy link

riyajatar37003 commented May 15, 2024

perf_error log

Command:
mpiexec --allow-run-as-root --tag-output -n 1 perf_analyzer --enable-mpi -m reranker -b 1 -u localhost:8001 -i grpc -f results.csv --verbose-csv --concurrency-range 24 --measurement-mode count_windows --collect-metrics --metrics-url http://localhost:8002/metrics --metrics-interval 1000 : -n 1 perf_analyzer --enable-mpi -m onnx -b 1 -u localhost:8001 -i grpc -f results.csv --verbose-csv --concurrency-range 8 --measurement-mode count_windows --collect-metrics --metrics-url http://localhost:8002/metrics --metrics-interval 1000

Error: perf_analyzer did not produce any output. It was likely terminated with a SIGABRT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

6 participants