Exceeding GPU memory even I have 2 GPUs
Once I run the following
CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.api_server --model /models/llama --tensor-parallel-size 2
I am getting the error mentioned below
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.25 GiB. GPU 0 has a total capacity of 15.60 GiB of which 700.88 MiB is free. Process 2514 has 14.91 GiB memory in use. Of the allocated memory 14.62 GiB is allocated by PyTorch, and 87.96 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
INFO 04-02 14:50:40 [multiproc_worker_utils.py:124] Killing local vLLM worker processes
[rank0]:[W402 14:50:41.497221295 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Following the workaround to avoid fragmentation I tried to export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True and then run the cmd, but no improvment
Appreciate your support
I am limited to 2 GPUs in my setup, each with 16 GB vRAM, I am using runpod for simulatation