Model usage on vLLM fails: `No available memory for the cache blocks` & `Error executing method 'determine_num_available_blocks'`

by surajd - opened 10 days ago

Discussion

surajd

10 days ago

•

edited 10 days ago

Depending on the --max-model-len value I am seeing the following errors:

With the value of 1024 I get the following error:

ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.

With the value of 8192 I get the following error:

Error executing method 'determine_num_available_blocks'. This might cause deadlock in distributed execution.

I have deployed the model on two machines with A100 GPUs, of machine SKU Standard_ND96asr_v4 on Azure. Each machine has 8 GPUs with 40GB of memory. The deployment is done on Kubernetes and the cluster is setup to use the infiniband network.

You can look at the full set of logs from the subsequent comments here: https://gist.github.com/surajssd/a89668b2cdfee36802b0a31c6a224bdc

surajd

10 days ago

•

edited 10 days ago

Here is the stack trace of the error with 1024:

$ export NCCL_DEBUG=INFO
$ export NCCL_NET_GDR_LEVEL=SYS
$ export NCCL_IB_DISABLE="0"
$ python3 -m vllm.entrypoints.openai.api_server \
    --port 8000 \
    --model nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 \
    --tensor-parallel-size 8 \
    --pipeline-parallel-size 2 \
    --trust-remote-code \
    --seed 1 \
    --max-model-len 1024 \
    --gpu-memory-utilization 0.95 \
    --enforce-eager
INFO 04-17 15:10:34 [__init__.py:239] Automatically detected platform cuda.
INFO 04-17 15:10:35 [api_server.py:1034] vLLM API server version 0.8.3
INFO 04-17 15:10:35 [api_server.py:1035] args: Namespace(host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='nvidia/Llama-3_1-Nemotron-Ultra-253B-v1', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=1024, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=2, tensor_parallel_size=8, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=1, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
INFO 04-17 15:10:43 [config.py:600] This model supports multiple tasks: {'reward', 'embed', 'classify', 'score', 'generate'}. Defaulting to 'generate'.
WARNING 04-17 15:10:43 [arg_utils.py:1708] Pipeline Parallelism without Ray distributed executor is not supported by the V1 Engine. Falling back to V0.
INFO 04-17 15:10:43 [config.py:1600] Defaulting to use ray for distributed inference
INFO 04-17 15:10:43 [llm_engine.py:242] Initializing a V0 LLM engine (v0.8.3) with config: model='nvidia/Llama-3_1-Nemotron-Ultra-253B-v1', speculative_config=None, tokenizer='nvidia/Llama-3_1-Nemotron-Ultra-253B-v1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=1, served_model_name=nvidia/Llama-3_1-Nemotron-Ultra-253B-v1, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False,
...
[SNIP]
...
[36m(RayWorkerWrapper pid=37396, ip=10.244.1.95)[0m INFO 04-17 15:11:16 [model_runner.py:1146] Model loading took 33.6914 GiB and 17.547453 seconds[32m [repeated 14x across cluster][0m
INFO 04-17 15:11:18 [worker.py:267] Memory profiling takes 1.96 seconds
INFO 04-17 15:11:18 [worker.py:267] the current vLLM instance can use total_gpu_memory (39.49GiB) x gpu_memory_utilization (0.95) = 37.52GiB
INFO 04-17 15:11:18 [worker.py:267] model weights take 25.39GiB; non_torch_memory takes 0.93GiB; PyTorch activation peak memory takes 0.35GiB; the rest of the memory reserved for KV Cache is 10.85GiB.
INFO 04-17 15:11:18 [executor_base.py:112] # cuda blocks: 0, # CPU blocks: 12787
INFO 04-17 15:11:18 [executor_base.py:117] Maximum concurrency for 1024 tokens per request: 0.00x
ERROR 04-17 15:11:18 [worker_base.py:620] Error executing method 'initialize_cache'. This might cause deadlock in distributed execution.
ERROR 04-17 15:11:18 [worker_base.py:620] Traceback (most recent call last):
ERROR 04-17 15:11:18 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method
ERROR 04-17 15:11:18 [worker_base.py:620]     return run_method(self, method, args, kwargs)
ERROR 04-17 15:11:18 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 15:11:18 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2347, in run_method
ERROR 04-17 15:11:18 [worker_base.py:620]     return func(*args, **kwargs)
ERROR 04-17 15:11:18 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 15:11:18 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 291, in initialize_cache
ERROR 04-17 15:11:18 [worker_base.py:620]     raise_if_cache_size_invalid(
ERROR 04-17 15:11:18 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 540, in raise_if_cache_size_invalid
ERROR 04-17 15:11:18 [worker_base.py:620]     raise ValueError("No available memory for the cache blocks. "
ERROR 04-17 15:11:18 [worker_base.py:620] ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
1-nemotron-ultra-253b-instruct-0:39422:39422 [0] NCCL INFO ncclCommInitRank comm 0x14533360 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 100000 commId 0xfaf39b9976212896 - Init COMPLETE
llama-3-1-nemotron-ultra-253b-instruct-0:39422:39422 [0] NCCL INFO Using non-device net plugin version 0
llama-3-1-nemotron-ultra-253b-instruct-0:39422:39422 [0] NCCL INFO Using network IB
llama-3-1-nemotron-ultra-253b-instruct-0:39422:39422 [0] NCCL INFO DMA-BUF is available on GPU device 0
llama-3-1-nemotron-ultra-253b-instruct-0:39422:39422 [0] NCCL INFO ncclCommInitRank comm 0x176c38b0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 100000 commId 0x6226897a1a4c6a84 - Init START
llama-3-1-nemotron-ultra-253b-instruct-0:39422:39422 [0] NCCL INFO Setting affinity for GPU 0 to ffff,ff000000
llama-3-1-nemotron-ultra-253b-instruct-0:39422:39422 [0] NCCL INFO comm 0x176c38b0 rank 0 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0
llama-3-1-nemotron-ultra-253b-instruct-0:39422:39422 [0] NCCL INFO Channel 00/02 :    0   1
llama-3-1-nemotron-ultra-253b-instruct-0:39422:39422 [0] NCCL INFO Channel 01/02 :    0   1
llama-3-1-nemotron-ultra-253b-instruct-0:39422:39422 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
llama-3-1-nemotron-ultra-253b-instruct-0:39422:39422 [0] NCCL INFO P2P Chunksize set to 131072
llama-3-1-nemotron-ultra-253b-instruct-0:39422:39422 [0] NCCL INFO NCCL_NET_GDR_LEVEL set by environment to SYS
llama-3-1-nemotron-ultra-253b-instruct-0:39422:39422 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [receive] via NET/IB/0/GDRDMA
llama-3-1-nemotron-ultra-253b-instruct-0:39422:39422 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [receive] via NET/IB/0/GDRDMA
llama-3-1-nemotron-ultra-253b-instruct-0:39422:39422 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/IB/0/GDRDMA
llama-3-1-nemotron-ultra-253b-instruct-0:39422:39422 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/IB/0/GDRDMA
llama-3-1-nemotron-ultra-253b-instruct-0:39422:39422 [0] NCCL INFO Connected all rings
llama-3-1-nemotron-ultra-253b-instruct-0:39422:39422 [0] NCCL INFO Connected all trees
llama-3-1-nemotron-ultra-253b-instruct-0:39422:39422 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
llama-3-1-nemotron-ultra-253b-instruct-0:39422:39422 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
llama-3-1-nemotron-ultra-253b-instruct-0:39422:39422 [0] NCCL INFO ncclCommInitRank comm 0x176c38b0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 100000 commId 0x6226897a1a4c6a84 - Init COMPLETE
[rank0]: Traceback (most recent call last):
[rank0]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank0]:   File "<frozen runpy>", line 88, in _run_code
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1121, in <module>
[rank0]:     uvloop.run(run_server(args))
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
[rank0]:     return __asyncio.run(
[rank0]:            ^^^^^^^^^^^^^^
[rank0]:   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
[rank0]:     return runner.run(main)
[rank0]:            ^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
[rank0]:     return self._loop.run_until_complete(task)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
[rank0]:     return await main
[rank0]:            ^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1069, in run_server
[rank0]:     async with build_async_engine_client(args) as engine_client:
[rank0]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
[rank0]:     return await anext(self.gen)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
[rank0]:     async with build_async_engine_client_from_engine_args(
[rank0]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
[rank0]:     return await anext(self.gen)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
[rank0]:     engine_client = AsyncLLMEngine.from_vllm_config(
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 653, in from_vllm_config
[rank0]:     return cls(
[rank0]:            ^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 608, in __init__
[rank0]:     self.engine = self._engine_class(*args, **kwargs)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 267, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 284, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 446, in _initialize_kv_caches
[rank0]:     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 123, in initialize_cache
[rank0]:     self.collective_rpc("initialize_cache",
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 331, in collective_rpc
[rank0]:     return self._run_workers(method, *args, **(kwargs or {}))
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_distributed_executor.py", line 516, in _run_workers
[rank0]:     self.driver_worker.execute_method(sent_method, *args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 621, in execute_method
[rank0]:     raise e
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method
[rank0]:     return run_method(self, method, args, kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2347, in run_method
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 291, in initialize_cache
[rank0]:     raise_if_cache_size_invalid(
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 540, in raise_if_cache_size_invalid
[rank0]:     raise ValueError("No available memory for the cache blocks. "
[rank0]: ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
[36m(RayWorkerWrapper pid=37396, ip=10.244.1.95)[0m ERROR 04-17 15:11:18 [worker_base.py:620] Error executing method 'initialize_cache'. This might cause deadlock in distributed execution.
[36m(RayWorkerWrapper pid=37396, ip=10.244.1.95)[0m ERROR 04-17 15:11:18 [worker_base.py:620] Traceback (most recent call last):
[36m(RayWorkerWrapper pid=37396, ip=10.244.1.95)[0m ERROR 04-17 15:11:18 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method
[36m(RayWorkerWrapper pid=37396, ip=10.244.1.95)[0m ERROR 04-17 15:11:18 [worker_base.py:620]     return run_method(self, method, args, kwargs)
[36m(RayWorkerWrapper pid=37396, ip=10.244.1.95)[0m ERROR 04-17 15:11:18 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[36m(RayWorkerWrapper pid=37396, ip=10.244.1.95)[0m ERROR 04-17 15:11:18 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2347, in run_method
[36m(RayWorkerWrapper pid=37396, ip=10.244.1.95)[0m ERROR 04-17 15:11:18 [worker_base.py:620]     return func(*args, **kwargs)
[36m(RayWorkerWrapper pid=37396, ip=10.244.1.95)[0m ERROR 04-17 15:11:18 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^
[36m(RayWorkerWrapper pid=37396, ip=10.244.1.95)[0m ERROR 04-17 15:11:18 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 291, in initialize_cache
[36m(RayWorkerWrapper pid=37396, ip=10.244.1.95)[0m ERROR 04-17 15:11:18 [worker_base.py:620]     raise_if_cache_size_invalid(
[36m(RayWorkerWrapper pid=37396, ip=10.244.1.95)[0m ERROR 04-17 15:11:18 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 540, in raise_if_cache_size_invalid
[36m(RayWorkerWrapper pid=37396, ip=10.244.1.95)[0m ERROR 04-17 15:11:18 [worker_base.py:620]     raise ValueError("No available memory for the cache blocks. "
[36m(RayWorkerWrapper pid=37396, ip=10.244.1.95)[0m ERROR 04-17 15:11:18 [worker_base.py:620] ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
[36m(RayWorkerWrapper pid=37398, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:37398:37398 [7] NCCL INFO NCCL_NET_GDR_LEVEL set by environment to SYS[32m [repeated 7x across cluster][0m
[36m(RayWorkerWrapper pid=37398, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:37398:37398 [7] NCCL INFO Channel 01/0 : 0[7] -> 1[7] [receive] via NET/IB/7/GDRDMA[32m [repeated 14x across cluster][0m
[36m(RayWorkerWrapper pid=37398, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:37398:37398 [7] NCCL INFO Channel 01/0 : 1[7] -> 0[7] [send] via NET/IB/7/GDRDMA[32m [repeated 14x across cluster][0m
[36m(RayWorkerWrapper pid=37398, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:37398:37398 [7] NCCL INFO Comm config Blocking set to 1[32m [repeated 7x across cluster][0m
[36m(RayWorkerWrapper pid=37394, ip=10.244.1.95)[0m NFO Connected all trees
[36m(RayWorkerWrapper pid=37393, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:37393:38461 [2] NCCL INFO Cha
[36m(RayWorkerWrapper pid=39763)[0m INFO 04-17 15:11:18 [worker.py:267] Memory profiling takes 1.97 seconds[32m [repeated 14x across cluster][0m
[36m(RayWorkerWrapper pid=39763)[0m INFO 04-17 15:11:18 [worker.py:267] the current vLLM instance can use total_gpu_memory (39.49GiB) x gpu_memory_utilization (0.95) = 37.52GiB[32m [repeated 14x across cluster][0m
[36m(RayWorkerWrapper pid=39763)[0m INFO 04-17 15:11:18 [worker.py:267] model weights take 25.39GiB; non_torch_memory takes 0.93GiB; PyTorch activation peak memory takes 0.35GiB; the rest of the memory reserved for KV Cache is 10.85GiB.[32m [repeated 14x across cluster][0m
[36m(RayWorkerWrapper pid=39763)[0m ERROR 04-17 15:11:18 [worker_base.py:620] Error executing method 'initialize_cache'. This might cause deadlock in distributed execution.[32m [repeated 14x across cluster][0m
[36m(RayWorkerWrapper pid=39763)[0m ERROR 04-17 15:11:18 [worker_base.py:620] Traceback (most recent call last):[32m [repeated 14x across cluster][0m
[36m(RayWorkerWrapper pid=39763)[0m ERROR 04-17 15:11:18 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method[32m [repeated 14x across cluster][0m
[36m(RayWorkerWrapper pid=39763)[0m ERROR 04-17 15:11:18 [worker_base.py:620]     return run_method(self, method, args, kwargs)[32m [repeated 14x across cluster][0m
[36m(RayWorkerWrapper pid=39763)[0m ERROR 04-17 15:11:18 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[32m [repeated 14x across cluster][0m
[36m(RayWorkerWrapper pid=39763)[0m ERROR 04-17 15:11:18 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2347, in run_method[32m [repeated 14x across cluster][0m
[36m(RayWorkerWrapper pid=39763)[0m ERROR 04-17 15:11:18 [worker_base.py:620]     return func(*args, **kwargs)[32m [repeated 14x across cluster][0m
[36m(RayWorkerWrapper pid=39763)[0m ERROR 04-17 15:11:18 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^[32m [repeated 14x across cluster][0m
[36m(RayWorkerWrapper pid=39763)[0m ERROR 04-17 15:11:18 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 291, in initialize_cache[32m [repeated 14x across cluster][0m
[36m(RayWorkerWrapper pid=39763)[0m ERROR 04-17 15:11:18 [worker_base.py:620]     raise_if_cache_size_invalid([32m [repeated 14x across cluster][0m
[36m(RayWorkerWrapper pid=39763)[0m ERROR 04-17 15:11:18 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 540, in raise_if_cache_size_invalid[32m [repeated 14x across cluster][0m
[36m(RayWorkerWrapper pid=39763)[0m ERROR 04-17 15:11:18 [worker_base.py:620]     raise ValueError("No available memory for the cache blocks. "[32m [repeated 14x across cluster][0m
[36m(RayWorkerWrapper pid=39763)[0m ERROR 04-17 15:11:18 [worker_base.py:620] ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.[32m [repeated 14x across cluster][0m
INFO 04-17 15:11:19 [ray_distributed_executor.py:127] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray.
INFO 04-17 15:11:19 [ray_distributed_executor.py:127] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray.
[rank0]:[W417 15:11:19.855155568 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
/usr/lib/python3.12/multiprocessing/resource_tracker.py:255: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

surajd

10 days ago

Here is the stack trace of the error with 8192:

$ export NCCL_DEBUG=INFO
$ export NCCL_NET_GDR_LEVEL=SYS
$ export NCCL_IB_DISABLE="0"
$ python3 -m vllm.entrypoints.openai.api_server \
    --port 8000 \
    --model nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 \
    --tensor-parallel-size 8 \
    --pipeline-parallel-size 2 \
    --trust-remote-code \
    --seed 1 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.95 \
    --enforce-eager
INFO 04-17 15:11:46 [__init__.py:239] Automatically detected platform cuda.
INFO 04-17 15:11:47 [api_server.py:1034] vLLM API server version 0.8.3
INFO 04-17 15:11:47 [api_server.py:1035] args: Namespace(host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='nvidia/Llama-3_1-Nemotron-Ultra-253B-v1', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=8192, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=2, tensor_parallel_size=8, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=1, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
INFO 04-17 15:11:55 [config.py:600] This model supports multiple tasks: {'embed', 'reward', 'classify', 'score', 'generate'}. Defaulting to 'generate'.
WARNING 04-17 15:11:55 [arg_utils.py:1708] Pipeline Parallelism without Ray distributed executor is not supported by the V1 Engine. Falling back to V0.
INFO 04-17 15:11:55 [config.py:1600] Defaulting to use ray for distributed inference
INFO 04-17 15:11:55 [llm_engine.py:242] Initializing a V0 LLM engine (v0.8.3) with config: model='nvidia/Llama-3_1-Nemotron-Ultra-253B-v1', speculative_config=None, tokenizer='nvidia/Llama-3_1-Nemotron-Ultra-253B-v1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=1, served_model_name=nvidia/Llama-3_1-Nemotron-Ultra-253B-v1, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False,
2025-04-17 15:11:56,959	INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.244.2.159:6379...
...
[SNIP]
...
[36m(RayWorkerWrapper pid=38532, ip=10.244.1.95)[0m CL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
[36m(RayWorkerWrapper pid=38532, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m llama-3-1-ne
INFO 04-17 15:12:32 [worker.py:267] Memory profiling takes 2.73 seconds
INFO 04-17 15:12:32 [worker.py:267] the current vLLM instance can use total_gpu_memory (39.49GiB) x gpu_memory_utilization (0.95) = 37.52GiB
INFO 04-17 15:12:32 [worker.py:267] model weights take 25.39GiB; non_torch_memory takes 0.93GiB; PyTorch activation peak memory takes 1.36GiB; the rest of the memory reserved for KV Cache is 9.84GiB.
[36m(RayWorkerWrapper pid=41196)[0m INFO 04-17 15:12:32 [worker.py:267] Memory profiling takes 2.77 seconds
[36m(RayWorkerWrapper pid=41196)[0m INFO 04-17 15:12:32 [worker.py:267] the current vLLM instance can use total_gpu_memory (39.49GiB) x gpu_memory_utilization (0.95) = 37.52GiB
[36m(RayWorkerWrapper pid=41196)[0m INFO 04-17 15:12:32 [worker.py:267] model weights take 25.39GiB; non_torch_memory takes 1.07GiB; PyTorch activation peak memory takes 1.36GiB; the rest of the memory reserved for KV Cache is 9.70GiB.
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620] Error executing method 'determine_num_available_blocks'. This might cause deadlock in distributed execution.
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620] Traceback (most recent call last):
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     return run_method(self, method, args, kwargs)
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2347, in run_method
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     return func(*args, **kwargs)
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     return func(*args, **kwargs)
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     self.model_runner.profile_run()
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     return func(*args, **kwargs)
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1243, in profile_run
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1369, in _dummy_run
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     self.execute_model(model_input, kv_caches, intermediate_tensors)
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     return func(*args, **kwargs)
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1816, in execute_model
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     logits = self.model.compute_logits(hidden_or_intermediate_states,
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/nemotron_nas.py", line 438, in compute_logits
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     logits = self.logits_processor(self.lm_head, hidden_states,
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     return self._call_impl(*args, **kwargs)
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     return forward_call(*args, **kwargs)
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/logits_processor.py", line 70, in forward
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     logits = self._get_logits(hidden_states, lm_head, embedding_bias)
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/logits_processor.py", line 113, in _get_logits
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     logits = self._gather_logits(logits)
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/logits_processor.py", line 98, in _gather_logits
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     logits = tensor_model_parallel_gather(logits)
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/communication_op.py", line 26, in tensor_model_parallel_gather
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     return get_tp_group().gather(input_, dst, dim)1-nemotron-ultra-253b-instruct-0:40855:40855 [0] NCCL INFO ncclCommInitRank comm 0x1de541f0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 100000 commId 0x46120fed5721c55a - Init COMPLETE
llama-3-1-nemotron-ultra-253b-instruct-0:40855:40855 [0] NCCL INFO Using non-device net plugin version 0
llama-3-1-nemotron-ultra-253b-instruct-0:40855:40855 [0] NCCL INFO Using network IB
llama-3-1-nemotron-ultra-253b-instruct-0:40855:40855 [0] NCCL INFO DMA-BUF is available on GPU device 0
llama-3-1-nemotron-ultra-253b-instruct-0:40855:40855 [0] NCCL INFO ncclCommInitRank comm 0x20fe4520 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 100000 commId 0x3a4c4134ba9c7a6d - Init START
llama-3-1-nemotron-ultra-253b-instruct-0:40855:40855 [0] NCCL INFO Setting affinity for GPU 0 to ffff,ff000000
llama-3-1-nemotron-ultra-253b-instruct-0:40855:40855 [0] NCCL INFO comm 0x20fe4520 rank 0 nRanks 2 nNodes 2 localRanks 1 localRank 0 MNNVL 0
llama-3-1-nemotron-ultra-253b-instruct-0:40855:40855 [0] NCCL INFO Channel 00/02 :    0   1
llama-3-1-nemotron-ultra-253b-instruct-0:40855:40855 [0] NCCL INFO Channel 01/02 :    0   1
llama-3-1-nemotron-ultra-253b-instruct-0:40855:40855 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
llama-3-1-nemotron-ultra-253b-instruct-0:40855:40855 [0] NCCL INFO P2P Chunksize set to 131072
llama-3-1-nemotron-ultra-253b-instruct-0:40855:40855 [0] NCCL INFO NCCL_NET_GDR_LEVEL set by environment to SYS
llama-3-1-nemotron-ultra-253b-instruct-0:40855:40855 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [receive] via NET/IB/0/GDRDMA
llama-3-1-nemotron-ultra-253b-instruct-0:40855:40855 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [receive] via NET/IB/0/GDRDMA
llama-3-1-nemotron-ultra-253b-instruct-0:40855:40855 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/IB/0/GDRDMA
llama-3-1-nemotron-ultra-253b-instruct-0:40855:40855 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/IB/0/GDRDMA
llama-3-1-nemotron-ultra-253b-instruct-0:40855:40855 [0] NCCL INFO Connected all rings
llama-3-1-nemotron-ultra-253b-instruct-0:40855:40855 [0] NCCL INFO Connected all trees
llama-3-1-nemotron-ultra-253b-instruct-0:40855:40855 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
llama-3-1-nemotron-ultra-253b-instruct-0:40855:40855 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
llama-3-1-nemotron-ultra-253b-instruct-0:40855:40855 [0] NCCL INFO ncclCommInitRank comm 0x20fe4520 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 100000 commId 0x3a4c4134ba9c7a6d - Init COMPLETE
[rank0]: Traceback (most recent call last):
[rank0]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank0]:   File "<frozen runpy>", line 88, in _run_code
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1121, in <module>
[rank0]:     uvloop.run(run_server(args))
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
[rank0]:     return __asyncio.run(
[rank0]:            ^^^^^^^^^^^^^^
[rank0]:   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
[rank0]:     return runner.run(main)
[rank0]:            ^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
[rank0]:     return self._loop.run_until_complete(task)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
[rank0]:     return await main
[rank0]:            ^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1069, in run_server
[rank0]:     async with build_async_engine_client(args) as engine_client:
[rank0]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
[rank0]:     return await anext(self.gen)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
[rank0]:     async with build_async_engine_client_from_engine_args(
[rank0]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
[rank0]:     return await anext(self.gen)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
[rank0]:     engine_client = AsyncLLMEngine.from_vllm_config(
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 653, in from_vllm_config
[rank0]:     return cls(
[rank0]:            ^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 608, in __init__
[rank0]:     self.engine = self._engine_class(*args, **kwargs)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 267, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 284, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 433, in _initialize_kv_caches
[rank0]:     self.model_executor.determine_num_available_blocks())
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks
[rank0]:     results = self.collective_rpc("determine_num_available_blocks")
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 331, in collective_rpc
[rank0]:     return self._run_workers(method, *args, **(kwargs or {}))
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_distributed_executor.py", line 521, in _run_workers
[rank0]:     ray_worker_outputs = ray.get(ray_worker_outputs)
[rank0]:                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2771, in get
[rank0]:     values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
[rank0]:                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 919, in get_objects
[rank0]:     raise value.as_instanceof_cause()
[rank0]: ray.exceptions.RayTaskError(RuntimeError): [36mray::RayWorkerWrapper.execute_method()[39m (pid=38526, ip=10.244.1.95, actor_id=1f11c4c44f87a44d7516c79816000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x7f6c09a57920>)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 621, in execute_method
[rank0]:     raise e
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method
[rank0]:     return run_method(self, method, args, kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2347, in run_method
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1243, in profile_run
[rank0]:     self._dummy_run(max_num_batched_tokens, max_num_seqs)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1369, in _dummy_run
[rank0]:     self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1816, in execute_model
[rank0]:     logits = self.model.compute_logits(hidden_or_intermediate_states,
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/nemotron_nas.py", line 438, in compute_logits
[rank0]:     logits = self.logits_processor(self.lm_head, hidden_states,
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/logits_processor.py", line 70, in forward
[rank0]:     logits = self._get_logits(hidden_states, lm_head, embedding_bias)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/logits_processor.py", line 113, in _get_logits
[rank0]:     logits = self._gather_logits(logits)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/logits_processor.py", line 98, in _gather_logits
[rank0]:     logits = tensor_model_parallel_gather(logits)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/communication_op.py", line 26, in tensor_model_parallel_gather
[rank0]:     return get_tp_group().gather(input_, dst, dim)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 341, in gather
[rank0]:     return self.device_communicator.gather(input_, dst, dim)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/base_device_communicator.py", line 86, in gather
[rank0]:     torch.distributed.gather(input_,
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 4006, in gather
[rank0]:     work = group.gather(output_tensors, input_tensors, opts)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)

[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 341, in gather
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     return self.device_communicator.gather(input_, dst, dim)
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/base_device_communicator.py", line 86, in gather
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     torch.distributed.gather(input_,
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     return func(*args, **kwargs)
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 4006, in gather
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     work = group.gather(output_tensors, input_tensors, opts)
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[36m(RayWorkerWrapper pid=38526, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620] RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
[36m(RayWorkerWrapper pid=38528, ip=10.244.1.95)[0m nnel 06/0 : 2[2] -> 1[1] via P2P/IPC/read
[36m(RayWorkerWrapper pid=38528, ip=10.244.1.95)[0m
[36m(RayWorkerWrapper pid=38528, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38528:39575 [2] include/alloc.h:179 NCCL WARN Cuda failure 'out of memory'
[36m(RayWorkerWrapper pid=38528, ip=10.244.1.95)[0m
[36m(RayWorkerWrapper pid=38528, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38528:39575 [2] include/alloc.h:186 NCCL WARN Failed to CUDA calloc 6291456 bytes
[36m(RayWorkerWrapper pid=38528, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38528:39575 [2] NCCL INFO transport/p2p.cc:218 -> 1
[36m(RayWorkerWrapper pid=38528, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38528:39575 [2] NCCL INFO transport/p2p.cc:598 -> 1
[36m(RayWorkerWrapper pid=38528, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38528:39603 [2] NCCL INFO transport/p2p.cc:396 -> 1
[36m(RayWorkerWrapper pid=38528, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38528:39603 [2] NCCL INFO transport.cc:33 -> 1
[36m(RayWorkerWrapper pid=38528, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38528:39603 [2] NCCL INFO transport.cc:122 -> 1
[36m(RayWorkerWrapper pid=38528, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38528:39603 [2] NCCL INFO group.cc:110 -> 1
[36m(RayWorkerWrapper pid=38528, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38528:39603 [2] NCCL INFO group.cc:64 -> 1 [Async thread]
[36m(RayWorkerWrapper pid=38528, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38528:38528 [2] NCCL INFO group.cc:418 -ERROR 04-17 15:12:32 [worker_base.py:620] Error executing method 'determine_num_available_blocks'. This might cause deadlock in distributed execution.
[36m(RayWorkerWrapper pid=38527, ip=10.244.1.95)[0m
[36m(RayWorkerWrapper pid=38527, ip=10.244.1.95)[0m
[36m(RayWorkerWrapper pid=38527, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38527:38527 [1] NCCERROR 04-17 15:12:32 [worker_base.py:620] Error executing method 'determine_num_available_blocks'. This might cause deadlock in distributed execution.
[36m(RayWorkerWrapper pid=38529, ip=10.244.1.95)[0m el 06/0 : 3[3] -> 2[2] via P2P/IPC/read
[36m(RayWorkerWrapper pid=38529, ip=10.244.1.95)[0m
[36m(RayWorkerWrapper pid=38529, ip=10.244.1.95)[0m
[36m(RayWorkerWrapper pid=38529, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38529:38529 [3] NCCL INFO group.cc:418 -> ERROR 04-17 15:12:32 [worker_base.py:620] Error executing method 'determine_num_available_blocks'. This might cause deadlock in distributed execution.
[36m(RayWorkerWrapper pid=38531, ip=10.244.1.95)[0m
[36m(RayWorkerWrapper pid=38531, ip=10.244.1.95)[0m
[36m(RayWorkerWrapper pid=38531, ip=10.244.1.95)[0m llama-3-1-nemotron-ultrERROR 04-17 15:12:32 [worker_base.py:620] Error executing method 'determine_num_available_blocks'. This might cause deadlock in distributed execution.
[36m(RayWorkerWrapper pid=38532, ip=10.244.1.95)[0m
[36m(RayWorkerWrapper pid=38532, ip=10.244.1.95)[0m
[36m(RayWorkerWrapper pid=38533, ip=10.244.1.95)[0m a P2P/IPC/read
[36m(RayWorkerWrapper pid=38533, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38533:39597 [7] NCCL INFO Channel 12/1 : 7[7] -> 0[0]
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38530:38530 [4] NCCL INFO NCCL_NET_GDR_LEVEL set by environment to SYS[32m [repeated 7x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38530:38530 [4] NCCL INFO Channel 01/0 : 0[4] -> 1[4] [receive] via NET/IB/4/GDRDMA[32m [repeated 14x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38530:38530 [4] NCCL INFO Channel 01/0 : 1[4] -> 0[4] [send] via NET/IB/4/GDRDMA[32m [repeated 14x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38530:38530 [4] NCCL INFO Comm config Blocking set to 1[32m [repeated 7x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38530:39566 [4] NCCL INFO Using non-device net plugin version 0[32m [repeated 14x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38530:39566 [4] NCCL INFO Using network IB[32m [repeated 14x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38530:39566 [4] NCCL INFO DMA-BUF is available on GPU device 4[32m [repeated 14x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38530:39566 [4] NCCL INFO ncclCommInitRank comm 0x2ddca3d0 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId b00000 commId 0x10b891a72369e0c9 - Init START[32m [repeated 14x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38530:39566 [4] NCCL INFO Setting affinity for GPU 4 to ffffff00,00000000,00000000[32m [repeated 10x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38530:39566 [4] NCCL INFO comm 0x2ddca3d0 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0[32m [repeated 14x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m 38530:39566 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3[32m [repeated 13x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38530:39566 [4] NCCL INFO P2P Chunksize set to 524288[32m [repeated 14x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38530:39602 [4] NCCL INFO Channel 04/1 : 4[4] -> 0[0] via P2P/IPC/read[32m [repeated 375x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38530:39566 [4] NCCL INFO ncclCommInitRank comm 0x2ddca3d0 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId b00000 commId 0x10b891a72369e0c9 - Init COMPLETE[32m [repeated 21x across cluster][0m
[36m(RayWorkerWrapper pid=38529, ip=10.244.1.95)[0m NFO Connected all trees
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38530:39566 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512[32m [repeated 19x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38530:39566 [4] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer[32m [repeated 19x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38530:38530 [4] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38530:38530 [4] NCCL INFO TUNER/Plugin: Using internal tuner plugin.[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38529, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38529:39569 [3] NCCL INFO Setting affinity for GPU 3 to ffffff[32m [repeated 2x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38530:39566 [4] NCCL INFO Connected all rings[32m [repeated 13x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38530:39566 [4] NCCL INFO Connected all trees[32m [repeated 18x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38530:39566 [4] NCCL INFO NVLS multicast support is not available on dev 4[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=41201)[0m INFO 04-17 15:12:32 [worker.py:267] Memory profiling takes 2.84 seconds[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=41201)[0m INFO 04-17 15:12:32 [worker.py:267] the current vLLM instance can use total_gpu_memory (39.49GiB) x gpu_memory_utilization (0.95) = 37.52GiB[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=41201)[0m INFO 04-17 15:12:32 [worker.py:267] model weights take 25.39GiB; non_torch_memory takes 1.07GiB; PyTorch activation peak memory takes 1.36GiB; the rest of the memory reserved for KV Cache is 9.70GiB.[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620] Traceback (most recent call last):[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 612, in execute_method[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     return run_method(self, method, args, kwargs)[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2347, in run_method[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     return func(*args, **kwargs)[32m [repeated 30x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^[32m [repeated 30x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context[32m [repeated 18x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     self.model_runner.profile_run()[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1243, in profile_run[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     self._dummy_run(max_num_batched_tokens, max_num_seqs)[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1369, in _dummy_run[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     self.execute_model(model_input, kv_caches, intermediate_tensors)[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1816, in execute_model[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     logits = self.model.compute_logits(hidden_or_intermediate_states,[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[32m [repeated 12x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/nemotron_nas.py", line 438, in compute_logits[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     logits = self.logits_processor(self.lm_head, hidden_states,[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     return self._call_impl(*args, **kwargs)[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     return forward_call(*args, **kwargs)[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/logits_processor.py", line 70, in forward[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     logits = self._get_logits(hidden_states, lm_head, embedding_bias)[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/logits_processor.py", line 113, in _get_logits[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     logits = self._gather_logits(logits)[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/logits_processor.py", line 98, in _gather_logits[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     logits = tensor_model_parallel_gather(logits)[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/communication_op.py", line 26, in tensor_model_parallel_gather[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     return get_tp_group().gather(input_, dst, dim)[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 4006, in gather[32m [repeated 18x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     return self.device_communicator.gather(input_, dst, dim)[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^[32m [repeated 12x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     torch.distributed.gather(input_,[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620]     work = group.gather(output_tensors, input_tensors, opts)[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m ERROR 04-17 15:12:32 [worker_base.py:620] RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)[32m [repeated 6x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38530:39583 [4] include/alloc.h:179 NCCL WARN Cuda failure 'out of memory'[32m [repeated 5x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38530:39583 [4] include/alloc.h:186 NCCL WARN Failed to CUDA calloc 6291456 bytes[32m [repeated 5x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38530:39602 [4] NCCL INFO group.cc:110 -> 1[32m [repeated 30x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m llama-3-1-nemotron-ultra-253b-instruct-0-1:38530:39602 [4] NCCL INFO group.cc:64 -> 1 [Async thread][32m [repeated 5x across cluster][0m
[36m(RayWorkerWrapper pid=38530, ip=10.244.1.95)[0m llama-3-1-nemotron-ultrERROR 04-17 15:12:32 [worker_base.py:620] Error executing method 'determine_num_available_blocks'. This might cause deadlock in distributed execution.[32m [repeated 2x across cluster][0m
INFO 04-17 15:12:33 [ray_distributed_executor.py:127] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray.
INFO 04-17 15:12:33 [ray_distributed_executor.py:127] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray.
[rank0]:[W417 15:12:33.726684447 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
/usr/lib/python3.12/multiprocessing/resource_tracker.py:255: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment