ZERO gpu and PyTorch mismatch

#4
by Chaerin5 - opened

Hi @hysts
Thank you for helping us with build issue yesterday!
I'm trying this Spaces now. But whenever I run something with gpu, error below occurs.
The error says H200 is the problem, but I thought ZeroGPU is A100. ??
I tried specifying PyTorch version at requirements.txt, but that makes configuration error.
Do you have any idea how to solve this issue below?

Many thanks,
Chaerin

NVIDIA H200 MIG 3g.71gb with CUDA capability sm_90 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75 sm_80 sm_86.
If you want to use the NVIDIA H200 MIG 3g.71gb GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/spaces/zero/wrappers.py", line 256, in thread_wrapper
    res = future.result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/user/app/app.py", line 615, in sample_diff
    z = torch.randn(
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

It worked good at my local computer with torch 2.5.0+cu118, torchvision 0.20.0+cu118, gradio 5.27.0, but the error occured at HuggingFace Spaces

@Chaerin5 Ah, sorry for the confusion. We're in the middle of migrating hardware of ZeroGPU Spaces from half A100 to one-third H200 and half H200, but the documentation hasn’t been updated yet. ("Half" here means the MIG configuration.) In the end, one-third H200 will be the default, but for now, half H200 is used.

@hysts Thanks for letting me know. But my problem is that the system is complaining CUDA capability sm_90 is not compatible with the current PyTorch installation.

I wasn't able to specify PyTorch version at requirements.txt, because that makes configuration error in HF spaces. So I deleted the PyTorch specification at requirements.txt, and assumed HF is installing the latest version of PyTorch for me. But with this, the Spaces is giving me CUDA capability sm_90 is not compatible with the current PyTorch installation... So I have no idea how to fix this issue.

Could you help me with this?🥹

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment