Job disconnected after 1000 steps

#1
by alielfilali01 - opened

Hi @osanseviero , @nateraw and @azzr
Today i launched a CPT script using llama-factory (based on HF's TRL) on a 4 L4 GPUs and after about 9 hours (542 minutes) and exactly 1000 steps (logged to wandb) the jupyter lab just got disconnected ! I don't know if is because of the Jupyter Lab space or if the Space's GPUs have this limit of 9 hours allocation in order to prevent high usage or something ?
Btw this is not the space that got disconnected, i just made it in order to open this discussion.
I'm cc @julien-c as well since he might have insights on the GPU allocation within spaces ...
Thank you and really appreciate your help in advance

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment