======================== START TIME: Sat Jul 6 09:36:54 UTC 2024 python3 version = Python 3.10.14 ======================== The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well. Token is valid (permission: write). Your token has been saved to /admin/home/ferdinand_mom/.cache/huggingface/token Login successful Already on 'bench_cluster' M examples/config_tiny_llama.py M examples/config_tiny_llama.yaml M examples/train_tiny_llama.sh Your branch is up to date with 'origin/bench_cluster'. Job status: RUNNING [2024-07-06 09:36:57,179] torch.distributed.run: [WARNING] [2024-07-06 09:36:57,179] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:36:57,179] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:36:57,179] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:36:57,188] torch.distributed.run: [WARNING] [2024-07-06 09:36:57,188] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:36:57,188] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:36:57,188] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:36:57,188] torch.distributed.run: [WARNING] [2024-07-06 09:36:57,188] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:36:57,188] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:36:57,188] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:36:57,190] torch.distributed.run: [WARNING] [2024-07-06 09:36:57,190] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:36:57,190] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:36:57,190] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:36:57,196] torch.distributed.run: [WARNING] [2024-07-06 09:36:57,196] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:36:57,196] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:36:57,196] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:36:57,199] torch.distributed.run: [WARNING] [2024-07-06 09:36:57,199] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:36:57,199] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:36:57,199] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:36:57,230] torch.distributed.run: [WARNING] [2024-07-06 09:36:57,230] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:36:57,230] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:36:57,230] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:36:57,259] torch.distributed.run: [WARNING] [2024-07-06 09:36:57,259] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:36:57,259] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:36:57,259] torch.distributed.run: [WARNING] ***************************************** [default0]:07/06/2024 09:37:15 [WARNING|DP=0|PP=0|TP=0|ip-26-0-161-138]: [Vocab Size Padding] Padded vocab (size: 50257) with 3 dummy tokens (new size: 50260) [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: Config: [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: Config(general=GeneralArgs(project='bench_cluster', [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: run='%date_%jobid', [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: seed=42, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: step=None, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: consumed_train_samples=None, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: benchmark_csv_path=None, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: ignore_sanity_checks=True), [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: parallelism=ParallelismArgs(dp=1, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: pp=16, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: tp=4, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: pp_engine=, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: tp_mode=, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: tp_linear_async_communication=False, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: expert_parallel_size=1), [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: model=ModelArgs(model_config=LlamaConfig(bos_token_id=1, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: eos_token_id=2, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: hidden_act='silu', [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: hidden_size=2048, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: initializer_range=0.02, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: intermediate_size=4096, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: is_llama_config=True, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: max_position_embeddings=4096, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: num_attention_heads=32, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: num_hidden_layers=24, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: num_key_value_heads=32, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: pad_token_id=None, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: pretraining_tp=1, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: rms_norm_eps=1e-05, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: rope_scaling=None, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: rope_theta=10000.0, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: tie_word_embeddings=True, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: use_cache=True, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: vocab_size=50260), [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: init_method=RandomInit(std=0.025), [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: dtype=torch.bfloat16, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: make_vocab_size_divisible_by=1, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: ddp_bucket_cap_mb=25), [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: tokenizer=TokenizerArgs(tokenizer_name_or_path='openai-community/gpt2', [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: tokenizer_revision=None, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: tokenizer_max_length=None), [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: checkpoints=CheckpointsArgs(checkpoints_path=PosixPath('/dev/null'), [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: checkpoint_interval=100000, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: save_initial_state=False, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: resume_checkpoint_path=None, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: checkpoints_path_is_shared_file_system=False), [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: logging=LoggingArgs(log_level='info', [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: log_level_replica='info', [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: iteration_step_info_interval=1), [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: tokens=TokensArgs(sequence_length=4096, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: train_steps=20, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: micro_batch_size=16, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: batch_accumulation_per_replica=64, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: val_check_interval=-1, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: limit_val_batches=0, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: limit_test_batches=0), [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: optimizer=OptimizerArgs(optimizer_factory=AdamWOptimizerArgs(adam_eps=1e-08, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: adam_beta1=0.9, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: adam_beta2=0.95, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: torch_adam_is_fused=True, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: name='adamW'), [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: zero_stage=1, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: weight_decay=0.01, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: clip_grad=1.0, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: accumulate_grad_in_fp32=True, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: learning_rate_scheduler=LRSchedulerArgs(learning_rate=0.0001, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: lr_warmup_steps=1, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: lr_warmup_style='linear', [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: lr_decay_style='linear', [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: lr_decay_steps=19, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: lr_decay_starting_step=None, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: min_decay_lr=1e-05)), [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: data_stages=[DatasetStageArgs(name='Training Stage', [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: start_training_step=1, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: data=DataArgs(dataset=PretrainDatasetsArgs(hf_dataset_or_datasets='roneneldan/TinyStories', [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: hf_dataset_splits='train', [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: hf_dataset_config_name=None, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: dataset_processing_num_proc_per_process=64, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: dataset_overwrite_cache=False, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: text_column_name='text'), [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: seed=42, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: num_loading_workers=0))], [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: profiler=ProfilerArgs(profiler_export_path=PosixPath('/fsx/ferdinandmom/ferdinand-hf/bench_cluster/results/llama-1B/64_GPUS/dp-1_tp-4_pp-16_mbz-16')), [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: lighteval=None) [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: Model Config: [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: LlamaConfig(bos_token_id=1, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: eos_token_id=2, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: hidden_act='silu', [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: hidden_size=2048, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: initializer_range=0.02, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: intermediate_size=4096, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: is_llama_config=True, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: max_position_embeddings=4096, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: num_attention_heads=32, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: num_hidden_layers=24, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: num_key_value_heads=32, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: pad_token_id=None, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: pretraining_tp=1, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: rms_norm_eps=1e-05, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: rope_scaling=None, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: rope_theta=10000.0, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: tie_word_embeddings=True, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: use_cache=True, [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: vocab_size=50260) [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: Building model.. [default0]:07/06/2024 09:37:15 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: Setting PP block ranks... [default0]:07/06/2024 09:37:31 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: Total number of parameters: 1.21G (2313.42MiB) [default0]:07/06/2024 09:37:31 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: Local number of parameters: 46.7M (89.10MiB) [default0]:07/06/2024 09:37:31 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: [After model building] Memory usage: 92.03MiB. Peak allocated: 94.06MiB Peak reserved: 96.00MiB [default0]:07/06/2024 09:37:31 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: No checkpoint path provided. [default0]:07/06/2024 09:37:31 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: Parametrizing model parameters using StandardParametrizator [default1]:07/06/2024 09:37:31 [INFO|DP=0|PP=10|TP=1|ip-26-0-165-59]: Local number of parameters: 21M (40.02MiB) [default1]:07/06/2024 09:37:31 [INFO|DP=0|PP=10|TP=1|ip-26-0-165-59]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default1]:07/06/2024 09:37:31 [INFO|DP=0|PP=10|TP=1|ip-26-0-165-59]: No checkpoint path provided. [default1]:07/06/2024 09:37:31 [INFO|DP=0|PP=0|TP=1|ip-26-0-161-138]: Local number of parameters: 46.7M (89.10MiB) [default1]:07/06/2024 09:37:31 [INFO|DP=0|PP=0|TP=1|ip-26-0-161-138]: [After model building] Memory usage: 92.03MiB. Peak allocated: 94.06MiB Peak reserved: 96.00MiB [default2]:07/06/2024 09:37:31 [INFO|DP=0|PP=10|TP=2|ip-26-0-165-59]: Local number of parameters: 21M (40.02MiB) [default2]:07/06/2024 09:37:31 [INFO|DP=0|PP=10|TP=2|ip-26-0-165-59]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default2]:07/06/2024 09:37:31 [INFO|DP=0|PP=10|TP=2|ip-26-0-165-59]: No checkpoint path provided. [default1]:07/06/2024 09:37:31 [INFO|DP=0|PP=0|TP=1|ip-26-0-161-138]: No checkpoint path provided. [default3]:07/06/2024 09:37:31 [INFO|DP=0|PP=10|TP=3|ip-26-0-165-59]: Local number of parameters: 21M (40.02MiB) [default3]:07/06/2024 09:37:31 [INFO|DP=0|PP=10|TP=3|ip-26-0-165-59]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default6]:07/06/2024 09:37:31 [INFO|DP=0|PP=1|TP=2|ip-26-0-161-138]: Local number of parameters: 21M (40.02MiB) [default6]:07/06/2024 09:37:31 [INFO|DP=0|PP=1|TP=2|ip-26-0-161-138]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default6]:07/06/2024 09:37:31 [INFO|DP=0|PP=1|TP=2|ip-26-0-161-138]: No checkpoint path provided. [default1]:07/06/2024 09:37:31 [INFO|DP=0|PP=8|TP=1|ip-26-0-165-131]: Local number of parameters: 10.5M (20.01MiB) [default1]:07/06/2024 09:37:31 [INFO|DP=0|PP=8|TP=1|ip-26-0-165-131]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default1]:07/06/2024 09:37:31 [INFO|DP=0|PP=8|TP=1|ip-26-0-165-131]: No checkpoint path provided. [default3]:07/06/2024 09:37:31 [INFO|DP=0|PP=10|TP=3|ip-26-0-165-59]: No checkpoint path provided. [default2]:07/06/2024 09:37:31 [INFO|DP=0|PP=0|TP=2|ip-26-0-161-138]: Local number of parameters: 46.7M (89.10MiB) [default0]:07/06/2024 09:37:31 [INFO|DP=0|PP=12|TP=0|ip-26-0-173-246]: Local number of parameters: 21M (40.02MiB) [default0]:07/06/2024 09:37:31 [INFO|DP=0|PP=12|TP=0|ip-26-0-173-246]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default0]:07/06/2024 09:37:31 [INFO|DP=0|PP=12|TP=0|ip-26-0-173-246]: No checkpoint path provided. [default0]:07/06/2024 09:37:31 [INFO|DP=0|PP=8|TP=0|ip-26-0-165-131]: Local number of parameters: 10.5M (20.01MiB) [default6]:07/06/2024 09:37:31 [INFO|DP=0|PP=11|TP=2|ip-26-0-165-59]: Local number of parameters: 10.5M (20.01MiB) [default6]:07/06/2024 09:37:31 [INFO|DP=0|PP=11|TP=2|ip-26-0-165-59]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default6]:07/06/2024 09:37:31 [INFO|DP=0|PP=11|TP=2|ip-26-0-165-59]: No checkpoint path provided. [default0]:07/06/2024 09:37:31 [INFO|DP=0|PP=6|TP=0|ip-26-0-163-147]: Local number of parameters: 21M (40.02MiB) [default0]:07/06/2024 09:37:31 [INFO|DP=0|PP=6|TP=0|ip-26-0-163-147]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default0]:07/06/2024 09:37:31 [INFO|DP=0|PP=6|TP=0|ip-26-0-163-147]: No checkpoint path provided. [default2]:07/06/2024 09:37:31 [INFO|DP=0|PP=0|TP=2|ip-26-0-161-138]: [After model building] Memory usage: 92.03MiB. Peak allocated: 94.06MiB Peak reserved: 96.00MiB [default2]:07/06/2024 09:37:31 [INFO|DP=0|PP=0|TP=2|ip-26-0-161-138]: No checkpoint path provided. [default5]:07/06/2024 09:37:31 [INFO|DP=0|PP=13|TP=1|ip-26-0-173-246]: Local number of parameters: 21M (40.02MiB) [default4]:07/06/2024 09:37:31 [INFO|DP=0|PP=13|TP=0|ip-26-0-173-246]: Local number of parameters: 21M (40.02MiB) [default0]:07/06/2024 09:37:31 [INFO|DP=0|PP=8|TP=0|ip-26-0-165-131]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default7]:07/06/2024 09:37:31 [INFO|DP=0|PP=11|TP=3|ip-26-0-165-59]: Local number of parameters: 10.5M (20.01MiB) [default7]:07/06/2024 09:37:31 [INFO|DP=0|PP=11|TP=3|ip-26-0-165-59]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default7]:07/06/2024 09:37:31 [INFO|DP=0|PP=11|TP=3|ip-26-0-165-59]: No checkpoint path provided. [default7]:07/06/2024 09:37:31 [INFO|DP=0|PP=7|TP=3|ip-26-0-163-147]: Local number of parameters: 21M (40.02MiB) [default7]:07/06/2024 09:37:31 [INFO|DP=0|PP=7|TP=3|ip-26-0-163-147]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default5]:07/06/2024 09:37:31 [INFO|DP=0|PP=13|TP=1|ip-26-0-173-246]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default5]:07/06/2024 09:37:31 [INFO|DP=0|PP=13|TP=1|ip-26-0-173-246]: No checkpoint path provided. [default6]:07/06/2024 09:37:31 [INFO|DP=0|PP=9|TP=2|ip-26-0-165-131]: Local number of parameters: 21M (40.02MiB) [default6]:07/06/2024 09:37:31 [INFO|DP=0|PP=9|TP=2|ip-26-0-165-131]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default6]:07/06/2024 09:37:31 [INFO|DP=0|PP=9|TP=2|ip-26-0-165-131]: No checkpoint path provided. [default0]:07/06/2024 09:37:31 [INFO|DP=0|PP=8|TP=0|ip-26-0-165-131]: No checkpoint path provided. [default7]:07/06/2024 09:37:31 [INFO|DP=0|PP=7|TP=3|ip-26-0-163-147]: No checkpoint path provided. [default4]:07/06/2024 09:37:31 [INFO|DP=0|PP=13|TP=0|ip-26-0-173-246]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default7]:07/06/2024 09:37:31 [INFO|DP=0|PP=9|TP=3|ip-26-0-165-131]: Local number of parameters: 21M (40.02MiB) [default7]:07/06/2024 09:37:31 [INFO|DP=0|PP=9|TP=3|ip-26-0-165-131]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default7]:07/06/2024 09:37:31 [INFO|DP=0|PP=9|TP=3|ip-26-0-165-131]: No checkpoint path provided. [default3]:07/06/2024 09:37:31 [INFO|DP=0|PP=6|TP=3|ip-26-0-163-147]: Local number of parameters: 21M (40.02MiB) [default4]:07/06/2024 09:37:31 [INFO|DP=0|PP=13|TP=0|ip-26-0-173-246]: No checkpoint path provided. [default3]:07/06/2024 09:37:31 [INFO|DP=0|PP=8|TP=3|ip-26-0-165-131]: Local number of parameters: 10.5M (20.01MiB) [default2]:07/06/2024 09:37:31 [INFO|DP=0|PP=6|TP=2|ip-26-0-163-147]: Local number of parameters: 21M (40.02MiB) [default0]:07/06/2024 09:37:31 [INFO|DP=0|PP=14|TP=0|ip-26-0-174-36]: Local number of parameters: 25.7M (49.09MiB) [default0]:07/06/2024 09:37:31 [INFO|DP=0|PP=14|TP=0|ip-26-0-174-36]: [After model building] Memory usage: 50.01MiB. Peak allocated: 50.03MiB Peak reserved: 52.00MiB [default0]:07/06/2024 09:37:31 [INFO|DP=0|PP=14|TP=0|ip-26-0-174-36]: No checkpoint path provided. [default3]:07/06/2024 09:37:31 [INFO|DP=0|PP=0|TP=3|ip-26-0-161-138]: Local number of parameters: 46.7M (89.10MiB) [default1]:07/06/2024 09:37:31 [INFO|DP=0|PP=12|TP=1|ip-26-0-173-246]: Local number of parameters: 21M (40.02MiB) [default3]:07/06/2024 09:37:31 [INFO|DP=0|PP=8|TP=3|ip-26-0-165-131]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default3]:07/06/2024 09:37:31 [INFO|DP=0|PP=8|TP=3|ip-26-0-165-131]: No checkpoint path provided. [default2]:07/06/2024 09:37:31 [INFO|DP=0|PP=6|TP=2|ip-26-0-163-147]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default3]:07/06/2024 09:37:31 [INFO|DP=0|PP=6|TP=3|ip-26-0-163-147]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default7]:07/06/2024 09:37:31 [INFO|DP=0|PP=15|TP=3|ip-26-0-174-36]: Local number of parameters: 0 (0.00MiB) [default3]:07/06/2024 09:37:31 [INFO|DP=0|PP=0|TP=3|ip-26-0-161-138]: [After model building] Memory usage: 92.03MiB. Peak allocated: 94.06MiB Peak reserved: 96.00MiB [default1]:07/06/2024 09:37:31 [INFO|DP=0|PP=12|TP=1|ip-26-0-173-246]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default1]:07/06/2024 09:37:31 [INFO|DP=0|PP=12|TP=1|ip-26-0-173-246]: No checkpoint path provided. [default4]:07/06/2024 09:37:31 [INFO|DP=0|PP=9|TP=0|ip-26-0-165-131]: Local number of parameters: 21M (40.02MiB) [default4]:07/06/2024 09:37:31 [INFO|DP=0|PP=9|TP=0|ip-26-0-165-131]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default4]:07/06/2024 09:37:31 [INFO|DP=0|PP=9|TP=0|ip-26-0-165-131]: No checkpoint path provided. [default3]:07/06/2024 09:37:31 [INFO|DP=0|PP=6|TP=3|ip-26-0-163-147]: No checkpoint path provided. [default7]:07/06/2024 09:37:31 [INFO|DP=0|PP=15|TP=3|ip-26-0-174-36]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default7]:07/06/2024 09:37:31 [INFO|DP=0|PP=15|TP=3|ip-26-0-174-36]: No checkpoint path provided. [default3]:07/06/2024 09:37:31 [INFO|DP=0|PP=0|TP=3|ip-26-0-161-138]: No checkpoint path provided. [default0]:07/06/2024 09:37:31 [INFO|DP=0|PP=4|TP=0|ip-26-0-163-134]: Local number of parameters: 21M (40.02MiB) [default0]:07/06/2024 09:37:31 [INFO|DP=0|PP=4|TP=0|ip-26-0-163-134]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default3]:07/06/2024 09:37:31 [INFO|DP=0|PP=12|TP=3|ip-26-0-173-246]: Local number of parameters: 21M (40.02MiB) [default2]:07/06/2024 09:37:31 [INFO|DP=0|PP=8|TP=2|ip-26-0-165-131]: Local number of parameters: 10.5M (20.01MiB) [default0]:07/06/2024 09:37:31 [INFO|DP=0|PP=10|TP=0|ip-26-0-165-59]: Local number of parameters: 21M (40.02MiB) [default0]:07/06/2024 09:37:31 [INFO|DP=0|PP=10|TP=0|ip-26-0-165-59]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default0]:07/06/2024 09:37:31 [INFO|DP=0|PP=10|TP=0|ip-26-0-165-59]: No checkpoint path provided. [default2]:07/06/2024 09:37:31 [INFO|DP=0|PP=6|TP=2|ip-26-0-163-147]: No checkpoint path provided. [default4]:07/06/2024 09:37:31 [INFO|DP=0|PP=15|TP=0|ip-26-0-174-36]: Local number of parameters: 0 (0.00MiB) [default4]:07/06/2024 09:37:31 [INFO|DP=0|PP=1|TP=0|ip-26-0-161-138]: Local number of parameters: 21M (40.02MiB) [default0]:07/06/2024 09:37:31 [INFO|DP=0|PP=4|TP=0|ip-26-0-163-134]: No checkpoint path provided. [default5]:07/06/2024 09:37:31 [INFO|DP=0|PP=5|TP=1|ip-26-0-163-134]: Local number of parameters: 10.5M (20.01MiB) [default2]:07/06/2024 09:37:31 [INFO|DP=0|PP=12|TP=2|ip-26-0-173-246]: Local number of parameters: 21M (40.02MiB) [default2]:07/06/2024 09:37:31 [INFO|DP=0|PP=8|TP=2|ip-26-0-165-131]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default2]:07/06/2024 09:37:31 [INFO|DP=0|PP=8|TP=2|ip-26-0-165-131]: No checkpoint path provided. [default5]:07/06/2024 09:37:31 [INFO|DP=0|PP=11|TP=1|ip-26-0-165-59]: Local number of parameters: 10.5M (20.01MiB) [default5]:07/06/2024 09:37:31 [INFO|DP=0|PP=11|TP=1|ip-26-0-165-59]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default5]:07/06/2024 09:37:31 [INFO|DP=0|PP=11|TP=1|ip-26-0-165-59]: No checkpoint path provided. [default6]:07/06/2024 09:37:31 [INFO|DP=0|PP=7|TP=2|ip-26-0-163-147]: Local number of parameters: 21M (40.02MiB) [default4]:07/06/2024 09:37:31 [INFO|DP=0|PP=15|TP=0|ip-26-0-174-36]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default6]:07/06/2024 09:37:31 [INFO|DP=0|PP=15|TP=2|ip-26-0-174-36]: Local number of parameters: 0 (0.00MiB) [default6]:07/06/2024 09:37:31 [INFO|DP=0|PP=15|TP=2|ip-26-0-174-36]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default6]:07/06/2024 09:37:31 [INFO|DP=0|PP=15|TP=2|ip-26-0-174-36]: No checkpoint path provided. [default4]:07/06/2024 09:37:31 [INFO|DP=0|PP=1|TP=0|ip-26-0-161-138]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default4]:07/06/2024 09:37:31 [INFO|DP=0|PP=1|TP=0|ip-26-0-161-138]: No checkpoint path provided. [default5]:07/06/2024 09:37:31 [INFO|DP=0|PP=5|TP=1|ip-26-0-163-134]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default2]:07/06/2024 09:37:31 [INFO|DP=0|PP=12|TP=2|ip-26-0-173-246]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default3]:07/06/2024 09:37:31 [INFO|DP=0|PP=12|TP=3|ip-26-0-173-246]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default5]:07/06/2024 09:37:31 [INFO|DP=0|PP=9|TP=1|ip-26-0-165-131]: Local number of parameters: 21M (40.02MiB) [default4]:07/06/2024 09:37:31 [INFO|DP=0|PP=11|TP=0|ip-26-0-165-59]: Local number of parameters: 10.5M (20.01MiB) [default4]:07/06/2024 09:37:31 [INFO|DP=0|PP=11|TP=0|ip-26-0-165-59]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default6]:07/06/2024 09:37:31 [INFO|DP=0|PP=7|TP=2|ip-26-0-163-147]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default6]:07/06/2024 09:37:31 [INFO|DP=0|PP=7|TP=2|ip-26-0-163-147]: No checkpoint path provided. [default4]:07/06/2024 09:37:31 [INFO|DP=0|PP=15|TP=0|ip-26-0-174-36]: No checkpoint path provided. [default7]:07/06/2024 09:37:31 [INFO|DP=0|PP=1|TP=3|ip-26-0-161-138]: Local number of parameters: 21M (40.02MiB) [default1]:07/06/2024 09:37:31 [INFO|DP=0|PP=4|TP=1|ip-26-0-163-134]: Local number of parameters: 21M (40.02MiB) [default4]:07/06/2024 09:37:31 [INFO|DP=0|PP=5|TP=0|ip-26-0-163-134]: Local number of parameters: 10.5M (20.01MiB) [default4]:07/06/2024 09:37:31 [INFO|DP=0|PP=5|TP=0|ip-26-0-163-134]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default3]:07/06/2024 09:37:31 [INFO|DP=0|PP=12|TP=3|ip-26-0-173-246]: No checkpoint path provided. [default5]:07/06/2024 09:37:31 [INFO|DP=0|PP=9|TP=1|ip-26-0-165-131]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default5]:07/06/2024 09:37:31 [INFO|DP=0|PP=9|TP=1|ip-26-0-165-131]: No checkpoint path provided. [default4]:07/06/2024 09:37:31 [INFO|DP=0|PP=11|TP=0|ip-26-0-165-59]: No checkpoint path provided. [default1]:07/06/2024 09:37:31 [INFO|DP=0|PP=6|TP=1|ip-26-0-163-147]: Local number of parameters: 21M (40.02MiB) [default1]:07/06/2024 09:37:31 [INFO|DP=0|PP=14|TP=1|ip-26-0-174-36]: Local number of parameters: 25.7M (49.09MiB) [default7]:07/06/2024 09:37:31 [INFO|DP=0|PP=1|TP=3|ip-26-0-161-138]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default5]:07/06/2024 09:37:31 [INFO|DP=0|PP=1|TP=1|ip-26-0-161-138]: Local number of parameters: 21M (40.02MiB) [default1]:07/06/2024 09:37:31 [INFO|DP=0|PP=4|TP=1|ip-26-0-163-134]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default2]:07/06/2024 09:37:31 [INFO|DP=0|PP=12|TP=2|ip-26-0-173-246]: No checkpoint path provided. [default6]:07/06/2024 09:37:31 [INFO|DP=0|PP=3|TP=2|ip-26-0-161-142]: Local number of parameters: 21M (40.02MiB) [default6]:07/06/2024 09:37:31 [INFO|DP=0|PP=3|TP=2|ip-26-0-161-142]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default6]:07/06/2024 09:37:31 [INFO|DP=0|PP=3|TP=2|ip-26-0-161-142]: No checkpoint path provided. [default1]:07/06/2024 09:37:31 [INFO|DP=0|PP=6|TP=1|ip-26-0-163-147]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default1]:07/06/2024 09:37:31 [INFO|DP=0|PP=14|TP=1|ip-26-0-174-36]: [After model building] Memory usage: 50.01MiB. Peak allocated: 50.03MiB Peak reserved: 52.00MiB [default1]:07/06/2024 09:37:31 [INFO|DP=0|PP=14|TP=1|ip-26-0-174-36]: No checkpoint path provided. [default5]:07/06/2024 09:37:31 [INFO|DP=0|PP=1|TP=1|ip-26-0-161-138]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default5]:07/06/2024 09:37:31 [INFO|DP=0|PP=1|TP=1|ip-26-0-161-138]: No checkpoint path provided. [default5]:07/06/2024 09:37:31 [INFO|DP=0|PP=5|TP=1|ip-26-0-163-134]: No checkpoint path provided. [default6]:07/06/2024 09:37:31 [INFO|DP=0|PP=13|TP=2|ip-26-0-173-246]: Local number of parameters: 21M (40.02MiB) [default5]:07/06/2024 09:37:31 [INFO|DP=0|PP=3|TP=1|ip-26-0-161-142]: Local number of parameters: 21M (40.02MiB) [default4]:07/06/2024 09:37:31 [INFO|DP=0|PP=7|TP=0|ip-26-0-163-147]: Local number of parameters: 21M (40.02MiB) [default5]:07/06/2024 09:37:31 [INFO|DP=0|PP=15|TP=1|ip-26-0-174-36]: Local number of parameters: 0 (0.00MiB) [default7]:07/06/2024 09:37:31 [INFO|DP=0|PP=1|TP=3|ip-26-0-161-138]: No checkpoint path provided. [default4]:07/06/2024 09:37:31 [INFO|DP=0|PP=5|TP=0|ip-26-0-163-134]: No checkpoint path provided. [default6]:07/06/2024 09:37:31 [INFO|DP=0|PP=13|TP=2|ip-26-0-173-246]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default5]:07/06/2024 09:37:31 [INFO|DP=0|PP=3|TP=1|ip-26-0-161-142]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default7]:07/06/2024 09:37:31 [INFO|DP=0|PP=3|TP=3|ip-26-0-161-142]: Local number of parameters: 21M (40.02MiB) [default7]:07/06/2024 09:37:31 [INFO|DP=0|PP=3|TP=3|ip-26-0-161-142]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default7]:07/06/2024 09:37:31 [INFO|DP=0|PP=3|TP=3|ip-26-0-161-142]: No checkpoint path provided. [default5]:07/06/2024 09:37:31 [INFO|DP=0|PP=3|TP=1|ip-26-0-161-142]: No checkpoint path provided. [default5]:07/06/2024 09:37:31 [INFO|DP=0|PP=7|TP=1|ip-26-0-163-147]: Local number of parameters: 21M (40.02MiB) [default5]:07/06/2024 09:37:31 [INFO|DP=0|PP=7|TP=1|ip-26-0-163-147]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default2]:07/06/2024 09:37:31 [INFO|DP=0|PP=14|TP=2|ip-26-0-174-36]: Local number of parameters: 25.7M (49.09MiB) [default1]:07/06/2024 09:37:31 [INFO|DP=0|PP=4|TP=1|ip-26-0-163-134]: No checkpoint path provided. [default6]:07/06/2024 09:37:31 [INFO|DP=0|PP=13|TP=2|ip-26-0-173-246]: No checkpoint path provided. [default4]:07/06/2024 09:37:31 [INFO|DP=0|PP=7|TP=0|ip-26-0-163-147]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default4]:07/06/2024 09:37:31 [INFO|DP=0|PP=7|TP=0|ip-26-0-163-147]: No checkpoint path provided. [default2]:07/06/2024 09:37:31 [INFO|DP=0|PP=14|TP=2|ip-26-0-174-36]: [After model building] Memory usage: 50.01MiB. Peak allocated: 50.03MiB Peak reserved: 52.00MiB [default2]:07/06/2024 09:37:31 [INFO|DP=0|PP=4|TP=2|ip-26-0-163-134]: Local number of parameters: 21M (40.02MiB) [default2]:07/06/2024 09:37:31 [INFO|DP=0|PP=4|TP=2|ip-26-0-163-134]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default7]:07/06/2024 09:37:31 [INFO|DP=0|PP=13|TP=3|ip-26-0-173-246]: Local number of parameters: 21M (40.02MiB) [default1]:07/06/2024 09:37:31 [INFO|DP=0|PP=6|TP=1|ip-26-0-163-147]: No checkpoint path provided. [default3]:07/06/2024 09:37:31 [INFO|DP=0|PP=14|TP=3|ip-26-0-174-36]: Local number of parameters: 25.7M (49.09MiB) [default5]:07/06/2024 09:37:31 [INFO|DP=0|PP=15|TP=1|ip-26-0-174-36]: [After model building] Memory usage: 0.01MiB. Peak allocated: 0.03MiB Peak reserved: 2.00MiB [default2]:07/06/2024 09:37:31 [INFO|DP=0|PP=4|TP=2|ip-26-0-163-134]: No checkpoint path provided. [default7]:07/06/2024 09:37:31 [INFO|DP=0|PP=13|TP=3|ip-26-0-173-246]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default7]:07/06/2024 09:37:31 [INFO|DP=0|PP=13|TP=3|ip-26-0-173-246]: No checkpoint path provided. [default5]:07/06/2024 09:37:31 [INFO|DP=0|PP=7|TP=1|ip-26-0-163-147]: No checkpoint path provided. [default3]:07/06/2024 09:37:31 [INFO|DP=0|PP=14|TP=3|ip-26-0-174-36]: [After model building] Memory usage: 50.01MiB. Peak allocated: 50.03MiB Peak reserved: 52.00MiB [default5]:07/06/2024 09:37:31 [INFO|DP=0|PP=15|TP=1|ip-26-0-174-36]: No checkpoint path provided. [default2]:07/06/2024 09:37:31 [INFO|DP=0|PP=14|TP=2|ip-26-0-174-36]: No checkpoint path provided. [default7]:07/06/2024 09:37:31 [INFO|DP=0|PP=5|TP=3|ip-26-0-163-134]: Local number of parameters: 10.5M (20.01MiB) [default3]:07/06/2024 09:37:31 [INFO|DP=0|PP=14|TP=3|ip-26-0-174-36]: No checkpoint path provided. [default7]:07/06/2024 09:37:31 [INFO|DP=0|PP=5|TP=3|ip-26-0-163-134]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default7]:07/06/2024 09:37:31 [INFO|DP=0|PP=5|TP=3|ip-26-0-163-134]: No checkpoint path provided. [default3]:07/06/2024 09:37:31 [INFO|DP=0|PP=4|TP=3|ip-26-0-163-134]: Local number of parameters: 21M (40.02MiB) [default3]:07/06/2024 09:37:31 [INFO|DP=0|PP=4|TP=3|ip-26-0-163-134]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default6]:07/06/2024 09:37:31 [INFO|DP=0|PP=5|TP=2|ip-26-0-163-134]: Local number of parameters: 10.5M (20.01MiB) [default6]:07/06/2024 09:37:31 [INFO|DP=0|PP=5|TP=2|ip-26-0-163-134]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default6]:07/06/2024 09:37:31 [INFO|DP=0|PP=5|TP=2|ip-26-0-163-134]: No checkpoint path provided. [default0]:07/06/2024 09:37:31 [INFO|DP=0|PP=2|TP=0|ip-26-0-161-142]: Local number of parameters: 10.5M (20.01MiB) [default0]:07/06/2024 09:37:31 [INFO|DP=0|PP=2|TP=0|ip-26-0-161-142]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default3]:07/06/2024 09:37:31 [INFO|DP=0|PP=4|TP=3|ip-26-0-163-134]: No checkpoint path provided. [default0]:07/06/2024 09:37:31 [INFO|DP=0|PP=2|TP=0|ip-26-0-161-142]: No checkpoint path provided. [default1]:07/06/2024 09:37:31 [INFO|DP=0|PP=2|TP=1|ip-26-0-161-142]: Local number of parameters: 10.5M (20.01MiB) [default1]:07/06/2024 09:37:31 [INFO|DP=0|PP=2|TP=1|ip-26-0-161-142]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default1]:07/06/2024 09:37:31 [INFO|DP=0|PP=2|TP=1|ip-26-0-161-142]: No checkpoint path provided. [default3]:07/06/2024 09:37:31 [INFO|DP=0|PP=2|TP=3|ip-26-0-161-142]: Local number of parameters: 10.5M (20.01MiB) [default3]:07/06/2024 09:37:31 [INFO|DP=0|PP=2|TP=3|ip-26-0-161-142]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default3]:07/06/2024 09:37:31 [INFO|DP=0|PP=2|TP=3|ip-26-0-161-142]: No checkpoint path provided. [default4]:07/06/2024 09:37:31 [INFO|DP=0|PP=3|TP=0|ip-26-0-161-142]: Local number of parameters: 21M (40.02MiB) [default4]:07/06/2024 09:37:31 [INFO|DP=0|PP=3|TP=0|ip-26-0-161-142]: [After model building] Memory usage: 42.03MiB. Peak allocated: 44.06MiB Peak reserved: 46.00MiB [default4]:07/06/2024 09:37:31 [INFO|DP=0|PP=3|TP=0|ip-26-0-161-142]: No checkpoint path provided. [default2]:07/06/2024 09:37:31 [INFO|DP=0|PP=2|TP=2|ip-26-0-161-142]: Local number of parameters: 10.5M (20.01MiB) [default2]:07/06/2024 09:37:31 [INFO|DP=0|PP=2|TP=2|ip-26-0-161-142]: [After model building] Memory usage: 21.02MiB. Peak allocated: 23.05MiB Peak reserved: 24.00MiB [default2]:07/06/2024 09:37:31 [INFO|DP=0|PP=2|TP=2|ip-26-0-161-142]: No checkpoint path provided. [default0]:07/06/2024 09:37:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: [Optimizer Building] Using LearningRateForSP as learning rate [default0]:07/06/2024 09:37:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: [ZeRO sharding] Size of optimizer params per rank: [default0]:07/06/2024 09:37:33 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: [ZeRO sharding] DP Rank 0 has 46.7M out of 46.7M (100.00%) params' optimizer states [default0]:07/06/2024 09:37:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: [Training Plan] Stage Training Stage has 19 remaining training steps and has consumed 0 samples [default0]:07/06/2024 09:37:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: Using `datasets` library [default0]:07/06/2024 09:37:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: Loading tokenizer from openai-community/gpt2 and transformers/hf_hub versions ('4.41.2', '0.23.4') [default0]:07/06/2024 09:37:34 [WARNING|DP=0|PP=0|TP=0|ip-26-0-161-138]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/06/2024 09:37:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: [Training Plan] There are 1 training stages [default0]:07/06/2024 09:37:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: [Stage Training Stage] start from step 1 [default0]:07/06/2024 09:37:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: [default0]:07/06/2024 09:37:34 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: [Start training] datetime: 2024-07-06 09:37:34.939533 | mbs: 16 | grad_accum: 64 | global_batch_size: 1024 | sequence_length: 4096 | train_steps: 20 | start_iteration_step: 0 | consumed_train_samples: 0 [default1]:07/06/2024 09:37:35 [WARNING|DP=0|PP=10|TP=1|ip-26-0-165-59]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/06/2024 09:37:35 [WARNING|DP=0|PP=0|TP=1|ip-26-0-161-138]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/06/2024 09:37:35 [WARNING|DP=0|PP=9|TP=2|ip-26-0-165-131]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/06/2024 09:37:35 [WARNING|DP=0|PP=12|TP=0|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/06/2024 09:37:35 [WARNING|DP=0|PP=13|TP=0|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/06/2024 09:37:35 [WARNING|DP=0|PP=6|TP=0|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/06/2024 09:37:35 [WARNING|DP=0|PP=7|TP=3|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/06/2024 09:37:35 [WARNING|DP=0|PP=12|TP=3|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/06/2024 09:37:35 [WARNING|DP=0|PP=9|TP=3|ip-26-0-165-131]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/06/2024 09:37:35 [WARNING|DP=0|PP=7|TP=2|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/06/2024 09:37:35 [WARNING|DP=0|PP=13|TP=3|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/06/2024 09:37:35 [WARNING|DP=0|PP=6|TP=3|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/06/2024 09:37:35 [WARNING|DP=0|PP=9|TP=1|ip-26-0-165-131]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/06/2024 09:37:35 [WARNING|DP=0|PP=14|TP=0|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/06/2024 09:37:35 [WARNING|DP=0|PP=10|TP=0|ip-26-0-165-59]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/06/2024 09:37:35 [WARNING|DP=0|PP=6|TP=1|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/06/2024 09:37:35 [WARNING|DP=0|PP=1|TP=0|ip-26-0-161-138]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/06/2024 09:37:35 [WARNING|DP=0|PP=8|TP=2|ip-26-0-165-131]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/06/2024 09:37:35 [WARNING|DP=0|PP=7|TP=1|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/06/2024 09:37:35 [WARNING|DP=0|PP=15|TP=3|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/06/2024 09:37:35 [WARNING|DP=0|PP=5|TP=0|ip-26-0-163-134]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/06/2024 09:37:35 [WARNING|DP=0|PP=11|TP=1|ip-26-0-165-59]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/06/2024 09:37:35 [WARNING|DP=0|PP=7|TP=0|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/06/2024 09:37:35 [WARNING|DP=0|PP=4|TP=0|ip-26-0-163-134]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/06/2024 09:37:35 [WARNING|DP=0|PP=4|TP=1|ip-26-0-163-134]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/06/2024 09:37:35 [WARNING|DP=0|PP=11|TP=0|ip-26-0-165-59]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/06/2024 09:37:35 [WARNING|DP=0|PP=14|TP=1|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/06/2024 09:37:35 [WARNING|DP=0|PP=5|TP=1|ip-26-0-163-134]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/06/2024 09:37:35 [WARNING|DP=0|PP=3|TP=3|ip-26-0-161-142]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/06/2024 09:37:35 [WARNING|DP=0|PP=3|TP=2|ip-26-0-161-142]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/06/2024 09:37:35 [WARNING|DP=0|PP=3|TP=1|ip-26-0-161-142]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/06/2024 09:37:35 [WARNING|DP=0|PP=14|TP=2|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/06/2024 09:37:35 [WARNING|DP=0|PP=5|TP=3|ip-26-0-163-134]: Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/06/2024 09:37:35 [WARNING|DP=0|PP=5|TP=2|ip-26-0-163-134]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default1]:07/06/2024 09:37:35 [WARNING|DP=0|PP=2|TP=1|ip-26-0-161-142]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/06/2024 09:37:35 [WARNING|DP=0|PP=2|TP=3|ip-26-0-161-142]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/06/2024 09:37:35 [WARNING|DP=0|PP=3|TP=0|ip-26-0-161-142]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/06/2024 09:37:35 [WARNING|DP=0|PP=10|TP=2|ip-26-0-165-59]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/06/2024 09:37:35 [WARNING|DP=0|PP=10|TP=3|ip-26-0-165-59]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/06/2024 09:37:35 [WARNING|DP=0|PP=1|TP=2|ip-26-0-161-138]: Repo card metadata block was not found. Setting CardData to empty. [default7]:07/06/2024 09:37:35 [WARNING|DP=0|PP=11|TP=3|ip-26-0-165-59]: Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/06/2024 09:37:35 [WARNING|DP=0|PP=0|TP=2|ip-26-0-161-138]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/06/2024 09:37:35 [WARNING|DP=0|PP=8|TP=1|ip-26-0-165-131]: Repo card metadata block was not found. Setting CardData to empty. [default0]:07/06/2024 09:37:35 [WARNING|DP=0|PP=8|TP=0|ip-26-0-165-131]: Repo card metadata block was not found. Setting CardData to empty. [default3]:07/06/2024 09:37:35 [WARNING|DP=0|PP=8|TP=3|ip-26-0-165-131]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/06/2024 09:37:35 [WARNING|DP=0|PP=13|TP=2|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/06/2024 09:37:35 [WARNING|DP=0|PP=0|TP=3|ip-26-0-161-138]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default4]:07/06/2024 09:37:35 [WARNING|DP=0|PP=15|TP=0|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default6]:07/06/2024 09:37:35 [WARNING|DP=0|PP=15|TP=2|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/06/2024 09:37:35 [WARNING|DP=0|PP=4|TP=2|ip-26-0-163-134]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/06/2024 09:37:35 [WARNING|DP=0|PP=4|TP=3|ip-26-0-163-134]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/06/2024 09:37:35 [WARNING|DP=0|PP=15|TP=1|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/06/2024 09:37:35 [WARNING|DP=0|PP=2|TP=0|ip-26-0-161-142]: Repo card metadata block was not found. Setting CardData to empty. [default0]:Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/06/2024 09:37:35 [WARNING|DP=0|PP=2|TP=2|ip-26-0-161-142]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default1]:Repo card metadata block was not found. Setting CardData to empty. [default7]:Repo card metadata block was not found. Setting CardData to empty. [default5]:07/06/2024 09:37:35 [WARNING|DP=0|PP=13|TP=1|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default1]:07/06/2024 09:37:35 [WARNING|DP=0|PP=12|TP=1|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default5]:Repo card metadata block was not found. Setting CardData to empty. [default2]:07/06/2024 09:37:35 [WARNING|DP=0|PP=6|TP=2|ip-26-0-163-147]: Repo card metadata block was not found. Setting CardData to empty. [default2]:07/06/2024 09:37:35 [WARNING|DP=0|PP=12|TP=2|ip-26-0-173-246]: Repo card metadata block was not found. Setting CardData to empty. [default4]:07/06/2024 09:37:35 [WARNING|DP=0|PP=9|TP=0|ip-26-0-165-131]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default4]:Repo card metadata block was not found. Setting CardData to empty. [default7]:07/06/2024 09:37:35 [WARNING|DP=0|PP=1|TP=3|ip-26-0-161-138]: Repo card metadata block was not found. Setting CardData to empty. [default5]:07/06/2024 09:37:35 [WARNING|DP=0|PP=1|TP=1|ip-26-0-161-138]: Repo card metadata block was not found. Setting CardData to empty. [default2]:Repo card metadata block was not found. Setting CardData to empty. [default3]:07/06/2024 09:37:35 [WARNING|DP=0|PP=14|TP=3|ip-26-0-174-36]: Repo card metadata block was not found. Setting CardData to empty. [default3]:Repo card metadata block was not found. Setting CardData to empty. [default6]:07/06/2024 09:37:40 [WARNING|DP=0|PP=11|TP=2|ip-26-0-165-59]: Repo card metadata block was not found. Setting CardData to empty. [default6]:Repo card metadata block was not found. Setting CardData to empty. [default0]:07/06/2024 09:37:41 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: Resuming training from stage Training Stage, it has trained for 0 samples and has 19 remaining train steps [default0]:07/06/2024 09:37:41 [INFO|DP=0|PP=0|TP=0|ip-26-0-161-138]: Memory usage: 448.42MiB. Peak allocated 448.42MiB. Peak reserved: 456.00MiB [default0]:STAGE:2024-07-06 09:37:41 1423316:1423316 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [default0]:STAGE:2024-07-06 09:37:41 1423316:1423316 ActivityProfilerController.cpp:320] Completed Stage: Collection [default0]:STAGE:2024-07-06 09:37:41 1423316:1423316 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [default0]:Traceback (most recent call last): [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]: trainer.train(dataloader) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default0]: outputs = self.pipeline_engine.train_batch_iter( [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default0]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]: output = model(**micro_batch) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: return self._call_impl(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]: return forward_call(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default0]: sharded_logits = self.model( [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: return self._call_impl(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]: return forward_call(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: return self._call_impl(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]: return forward_call(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 151, in forward [default0]: output = self.pp_block(**new_kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: return self._call_impl(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]: return forward_call(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 628, in forward [default0]: hidden_states = self.input_layernorm(hidden_states) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: return self._call_impl(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]: return forward_call(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/nn/layer_norm.py", line 42, in forward [default0]: return layer_norm_fn( [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/flash_attn/ops/triton/layer_norm.py", line 875, in layer_norm_fn [default0]: return LayerNormFn.apply( [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply [default0]: return super().apply(*args, **kwargs) # type: ignore[misc] [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/flash_attn/ops/triton/layer_norm.py", line 748, in forward [default0]: y, y1, mean, rstd, residual_out, seeds, dropout_mask, dropout_mask1 = _layer_norm_fwd( [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/flash_attn/ops/triton/layer_norm.py", line 335, in _layer_norm_fwd [default0]: _layer_norm_fwd_1pass_kernel[(M,)]( [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 143, in run [default0]: timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 143, in [default0]: timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 122, in _bench [default0]: return do_bench(kernel_call, warmup=self.warmup, rep=self.rep, quantiles=(0.5, 0.2, 0.8)) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/testing.py", line 102, in do_bench [default0]: fn() [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 110, in kernel_call [default0]: self.fn.run( [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 305, in run [default0]: return self.fn.run(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 305, in run [default0]: return self.fn.run(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 305, in run [default0]: return self.fn.run(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/runtime/jit.py", line 532, in run [default0]: self.cache[device][key] = compile( [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/compiler/compiler.py", line 503, in compile [default0]: metadata_group = fn_cache_manager.get_group(metadata_filename) or {} [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/triton/runtime/cache.py", line 89, in get_group [default0]: with open(grp_filepath) as f: [default0]:FileNotFoundError: [Errno 2] No such file or directory: '/admin/home/ferdinand_mom/.triton/cache/1df5929dd2d869dace90696537f02efe/__grp___layer_norm_fwd_1pass_kernel.json' [default4]:Traceback (most recent call last): [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]: trainer.train(dataloader) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default4]: outputs = self.pipeline_engine.train_batch_iter( [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]: output = model(**micro_batch) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: return self._call_impl(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default4]: sharded_logits = self.model( [default4]:Traceback (most recent call last): [default4]:Traceback (most recent call last): [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]: trainer.train(dataloader) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default4]: outputs = self.pipeline_engine.train_batch_iter( [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:Traceback (most recent call last): [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]: trainer.train(dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default3]: outputs = self.pipeline_engine.train_batch_iter( [default2]:Traceback (most recent call last): [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]: trainer.train(dataloader) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: return self._call_impl(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_c[default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]: trainer.train(dataloader) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default4]: outputs = self.pipeline_engine.train_batch_iter( [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]: output = model(**micro_batch) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: return self._call_impl(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default4]: sharded_logits = self.model( [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: return self._call_impl(*args, **kwargs) [default4]: File "/fsx/f[default1]:Traceback (most recent call last): [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]: trainer.train(dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:Traceback (most recent call last): all_impl [default4]: return self._call_impl(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]: pipeline_state.run_communication() [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]: recv_activation_tensor = recv_activation() [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_p[default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]: output = model(**micro_batch) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: return self._call_impl(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl erdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default1]: outputs = self.pipeline_engine.train_batch_iter( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]: output = model(**micro_batch) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]: output = model(**micro_batch) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinan[default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]: trainer.train(dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default1]: outputs = self.pipeline_engine.train_batch_iter( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) arallel/state.py", line 31, in __call__ [default4]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default4]: dist.recv( [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default4]: return func(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs[default4]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: return self._call_impl(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_dmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default2]: outputs = self.pipeline_engine.train_batch_iter( /env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default4]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:torch.distributed.DistBackendError: [11] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '10:11', but store->get('10:11') got error: Connection reset by peer [default4]: sharded_logits = self.model( [default0]:Traceback (most recent call last): [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]: trainer.train(dataloader) [default1]:Traceback (most recent call last): buffer [default4]: pipeline_state.run_communication() [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]: recv_activation_tensor = recv_activation() [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]: File "/fsx/ferdinandmom/ferdinand-h[default1]:Traceback (most recent call last): [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]: trainer.train(dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default1]: outputs = self.pipeline_engine.train_batch_iter( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]: output = [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default1]: sharded_logits = self.model( [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc507dbdd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: + 0x589518e (0x7fc53fd7718e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fc53fd719a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fc53fd71ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0[default1]:Traceback (most recent call last): [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default1]: trainer.train(dataloader) f/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default4]: dist.recv( [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default4]: return func(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default4]: pg.recv([tensor], group_src_rank, tag).wait() model(**micro_batch) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward x7fc53fd72b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc53fd27f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc53fd27f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc53fd27f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc53fd27f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&,[default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:Traceback (most recent call last): [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: outputs = self.pipeline_engine.train_batch_iter( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default0]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]: output = model(**micro_batch) [default4]:torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default4]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc717dfdd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #1: + 0x589518e (0x7fc74fdb718e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fc74fdb19a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #3: c10d::TCPStore::doGet(std::st[default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default1]: sharded_logits = self.model( [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: output = model(**micro_batch) [default1]:Traceback (most recent call last): int) + 0xa9 (0x7fc508f65c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fc508f6cc5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fc508f8fb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #12: + 0x5838439 (0x7fc53fd1a439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #13: + 0x5843330 (0x7fc53fd25330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packa[default1]: trainer.train(dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: return self._call_impl(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]: return forward_call(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]: sharded_logits = self.model( ring const&) + 0x32 (0x7fc74fdb1ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fc74fdb2b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc74fd67f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc74fd67f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc74fd67f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc74fd67f81 i[default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_[default1]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]: trainer.train(dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train ges/torch/lib/libtorch_cpu.so) [default4]:frame #14: + 0x58433c5 (0x7fc53fd253c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #15: + 0x4e893cc (0x7fc53f36b3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #16: + 0x1a08a88 (0x7fc53beeaa88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #17: + 0x5849a84 (0x7fc53fd2ba84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #18: + 0x584ed35 (0x7fc53fd30d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #19: + 0xc97eee (0x7fc5525e2eee in /fsx/ferdinandmom/mini[default4]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step n /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fc718fa5c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fc718facc5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fc718fcfb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #12: + 0x5838439 (0x7fc74fd5a439 in /fsx/ferdinandmom/miniforge3/envs/env-bebuffer [default1]: pipeline_state.run_communication() [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]: recv_activation_tensor = recv_activation() [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]: File "/fsx/ferdinandmom/ferdinand-h[default3]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]:Traceback (most recent call last): forge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:frame #20: + 0x413ea4 (0x7fc551d5eea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:frame #21: + 0x1445a6 (0x55a6b41185a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55a6b4111a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #23: + 0x150866 (0x55a6b4124866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55a6b410d142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55a6b4118a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #26: PyObject_Call + 0xbc (0x55a6b4124[default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: outputs = self.pipeline_engine.train_batch_iter( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter nch-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) f/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]: dist.recv( [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default1]: return func(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]: return forward_call(*args, **kwargs) [default1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default1]: outputs = self.pipeline_engine.train_batch_iter( f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55a6b410b2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55a6b4118a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55a6b41098fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #30: + 0x150582 (0x55a6b4124582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55a6b41098fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #32: + 0x150582 (0x55a6b4124582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55a6b41098fa in /fsx/ferdinandmom/miniforge3[default1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]:frame #13: + 0x5843330 (0x7fc74fd65330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #14: + 0x58433c5 (0x7fc74fd653c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #15: + 0x4e893cc (0x7fc74f3ab3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #16: + 0x1a08a88 (0x7fc74bf2aa88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #17: + 0x5849a84 (0x7fc74fd6ba84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default1]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:torch.distributed.DistBackendError: [4] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '3:4', but store->get('3:4') got error: Connection reset by peer [default1]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fe0091c4d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #1: + 0x589518e (0x7fe04117e18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::durat[default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default2]: sharded_logits = self.model( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter /envs/env-bench-cluster/bin/python3.10) [default4]:frame #34: + 0x150582 (0x55a6b4124582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55a6b41098fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55a6b4110f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55a6b4122c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #38: + 0x211239 (0x55a6b41e5239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55a6b4111a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55a6b410d3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10[default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]:frame #18: + 0x584ed35 (0x7fc74fd70d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #19: + 0xc97eee (0x7fc762622eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:frame #20: + 0x413ea4 (0x7fc761d9eea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) ion >) + 0x360 (0x7fe0411789a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fe041178ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fe041179b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe04112ef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe04112ef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #7: c10d::PrefixStore::get(std::string const&) + [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: output = model(**micro_batch) [default2]: return forward_call(*args, **kwargs) [default1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) ) [default4]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55a6b4118a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55a6b4108c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55a6b4118a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55a6b41098fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #45: + 0x150582 (0x55a6b4124582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #46: PyObject_Call + 0xbc (0x55a6b4124f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: trainer.train(dataloader) [default0]: return self._call_impl(*args, **kwargs) [default4]:frame #21: + 0x1445a6 (0x5611bc0d85a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #22: _PyObject_MakeTpCall + 0x26b (0x5611bc0d1a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #23: + 0x150866 (0x5611bc0e4866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5611bc0cd142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #25: _PyFunction_Vectorcall + 0x6c (0x5611bc0d8a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #26: PyObject_Call + 0xbc (0x5611bc0e4f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5611bc0cb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) 0x31 (0x7fe04112ef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe04112ef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fe00a36cc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default3]: sharded_logits = self.model( [default1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55a6b410b2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #48: + 0x150582 (0x55a6b4124582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #49: PyObject_Call + 0xbc (0x55a6b4124f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55a6b410b2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55a6b4118a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55a6b4111007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55a6b4122c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #54: + 0x[default1]: outputs = self.pipeline_engine.train_batch_iter( [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]:frame #28: _PyFunction_Vectorcall + 0x6c (0x5611bc0d8a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5611bc0c98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #30: + 0x150582 (0x5611bc0e4582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5611bc0c98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #32: + 0x150582 (0x5611bc0e4582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5611bc0c98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fe00a373c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fe00a396b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #12: + 0x5838439 (0x7fe041121439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #13: + 0x5843330 (0x7fe04112c330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #14: + 0x58433c5 (0x7fe04112c3c5 in /fsx/ferdinandmom/miniforge3/envs/env-[default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]: output = model(**micro_batch) 211239 (0x55a6b41e5239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #55: PyObject_Call + 0x207 (0x55a6b4125067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default4]: return forward_call(*args, **kwargs) [default0]: return forward_call(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:frame #34: + 0x150582 (0x5611bc0e4582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5611bc0c98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5611bc0d0f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #37: _PyObject_Call_Prepend + 0x69 (0x5611bc0e2c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #38: + 0x211239 (0x5611bc1a5239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #39: _PyObject_MakeTpCall + 0x26b (0x5611bc0d1a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5611bc0cd3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #41: _PyFunction_Vectbench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #15: + 0x4e893cc (0x7fe0407723cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55a6b410b2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #57: + 0x150582 (0x55a6b4124582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55a6b41098fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default0]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 782, in forward_with_hidden_states orcall + 0x6c (0x5611bc0d8a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5611bc0c8c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #16: + 0x1a08a88 (0x7fe03d2f1a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #17: + 0x5849a84 (0x7fe041132a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]:frame #59: + 0x150582 (0x55a6b4124582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #60: PyObject_Call + 0xbc (0x55a6b4124f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55a6b410b2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]: hidden_states = self.final_layer_norm(input=hidden_encoder_states["hidden_states"])["hidden_states"] [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]:frame #43: _PyFunction_Vectorcall + 0x6c (0x5611bc0d8a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5611bc0c98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #45: + 0x150582 (0x5611bc0e4582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #46: PyObject_Call + 0xbc (0x5611bc0e4f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5611bc0cb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #48: + 0x150582 (0x5611bc0e4582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #49: PyObject_Call + 0xbc (0x5611bc0e4f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x56[default1]:frame #18: + 0x584ed35 (0x7fe041137d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #19: + 0xc97eee (0x7fe0539e9eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:frame #20: + 0x413ea4 (0x7fe053165ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]: return self._call_impl(*args, **kwargs) [default1]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]: trainer.train(dataloader) [default4]:frame #62: + 0x150582 (0x55a6b4124582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #63: PyObject_Call + 0xbc (0x55a6b4124f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward 11bc0cb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #21: + 0x1445a6 (0x55811091a5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #22: _PyObject_MakeTpCall + 0x26b (0x558110913a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default1]:Traceback (most recent call last): [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]: trainer.train(dataloader) [default1]: output = model(**micro_batch) [default1]: output = model(**micro_batch) [default0]: return self._call_impl(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]:frame #51: _PyFunction_Vectorcall + 0x6c (0x5611bc0d8a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5611bc0d1007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #53: _PyObject_Call_Prepend + 0x69 (0x5611bc0e2c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #54: + 0x211239 (0x5611bc1a5239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #55: PyObject_Call + 0x207 (0x5611bc0e5067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5611bc0cb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #57: + 0x150582 (0x5611bc0e4582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #58: _PyEval_EvalFrameDefault +[default1]:frame #23: + 0x150866 (0x558110926866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55811090f142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default2]: outputs = self.pipeline_engine.train_batch_iter( [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default1]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) 0x13ca (0x5611bc0c98fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #59: + 0x150582 (0x5611bc0e4582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55811091aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #26: PyObject_Call + 0xbc (0x558110926f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55811090d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55811091aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default1]: outputs = self.pipeline_engine.train_batch_iter( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default1]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]: output = model(**micro_batch) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return self._call_impl(*args, **kwargs) [default0]: return forward_call(*args, **kwargs) [default4]:frame #60: PyObject_Call + 0xbc (0x5611bc0e4f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5611bc0cb2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #62: + 0x150582 (0x5611bc0e4582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #63: PyObject_Call + 0xbc (0x5611bc0e4f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default1]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55811090b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #30: + 0x150582 (0x558110926582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]: pipeline_state.run_communication() [default2]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]: output = model(**micro_batch) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default1]: sharded_logits = self.model( [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]:Traceback (most recent call last): [default5]:Traceback (most recent call last): [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]: trainer.train(dataloader) [default1]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55811090b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #32: + 0x150582 (0x558110926582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55811090b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default1]: sharded_logits = self.model( [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: return forward_call(*args, **kwargs) [default1]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default1]:frame #34: + 0x150582 (0x558110926582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55811090b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x558110912f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #37: _PyObject_Call_Prepend + 0x69 (0x558110924c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #38: + 0x211239 (0x5581109e7239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #39: _PyObject_MakeTpCall + 0x26b (0x558110913a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55811090f3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #41: _PyFunction_Vect[default3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default1]: sharded_logits = self.model( [default5]: outputs = self.pipeline_engine.train_batch_iter( [default6]: trainer.train(dataloader) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step orcall + 0x6c (0x55811091aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55811090ac5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55811091aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55811090b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #45: + 0x150582 (0x558110926582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #46: PyObject_Call + 0xbc (0x558110926f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: recv_activation_tensor = recv_activation() [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default2]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: sharded_logits = self.model( [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]: outputs = self.pipeline_engine.train_batch_iter( [default1]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55811090d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #48: + 0x150582 (0x558110926582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #49: PyObject_Call + 0xbc (0x558110926f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55811090d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55811091aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x558110913007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #53: _PyObject_Call_Prepend + 0x69 (0x558110924c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #54: + 0x[default1]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]: pipeline_state.run_communication() [default1]: return self._call_impl(*args, **kwargs) [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward 211239 (0x5581109e7239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #55: PyObject_Call + 0x207 (0x558110927067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55811090d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #57: + 0x150582 (0x558110926582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55811090b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #59: + 0x150582 (0x558110926582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #60: PyObject_Call + 0xbc (0x558110926f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55811090d2b3 in /fsx/ferdinandmom/miniforge3/e[default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: sharded_logits = self.model( [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]: return self._call_impl(*args, **kwargs) [default1]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter nvs/env-bench-cluster/bin/python3.10) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default2]: sharded_logits = self.model( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]: pipeline_state.run_communication() [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: output = model(**micro_batch) [default6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:frame #62: + 0x150582 (0x558110926582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #63: PyObject_Call + 0xbc (0x558110926f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default1]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]: recv_activation_tensor = recv_activation() [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]: return forward_call(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default3]:Traceback (most recent call last): [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]: recv_activation_tensor = recv_activation() [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]: pipeline_state.run_communication() [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]: output = model(**micro_batch) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:Traceback (most recent call last): [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default3]: return forward_call(*args, **kwargs) [default1]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: return forward_call(*args, **kwargs) [default1]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]: recv_activation_tensor = recv_activation() [default5]: return forward_call(*args, **kwargs) [default3]: trainer.train(dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default2]:Traceback (most recent call last): [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]: dist.recv( [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]: dist.recv( [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default1]: return func(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:torch.distributed.DistBackendError: [10] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '9:10', but store->get('9:10') got error: Connection reset by peer [default1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]: sharded_logits = self.model( [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: trainer.train(dataloader) [default1]: return func(*args, **kwargs) [default2]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f716bb24d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:Traceback (most recent call last): [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:Traceback (most recent call last): [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default3]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:frame #1: + 0x589518e (0x7f71a3ade18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f71a3ad89a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]: return forward_call(*args, **kwargs) [default4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default2]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f71a3ad8ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f71a3ad9b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f71a3a8ef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f71a3a8ef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: trainer.train(dataloader) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 782, in forward_with_hidden_states [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default3]: outputs = self.pipeline_engine.train_batch_iter( [default2]: trainer.train(dataloader) [default1]: pg.recv([tensor], group_src_rank, tag).wait() [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f71a3a8ef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f71a3a8ef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f716ccccc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f716ccd3c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f716ccf6b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #12: + 0x5838439 (0x7f71a3a81439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #13: + 0x5843330 (0x7f71a3a8c330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #14: + 0x58433c5 (0x7f71a3a8c3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #15: + 0x4e893cc (0x7f71a30d23cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #16: + 0x1a08a88 (0x7f719fc51a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/py[default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default0]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default1]:torch.distributed.DistBackendError: [12] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '11:12', but store->get('11:12') got error: Connection reset by peer [default1]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]: new_kwargs[name] = recv_from_pipeline_state_buffer( thon3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #17: + 0x5849a84 (0x7f71a3a92a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default6]: outputs = self.pipeline_engine.train_batch_iter( [default1]: return self._call_impl(*args, **kwargs) [default0]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: dist.recv( [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: return forward_call(*args, **kwargs) [default1]:frame #18: + 0x584ed35 (0x7f71a3a97d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default3]: recv_activation_tensor = recv_activation() [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]: pipeline_state.run_communication() [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:frame #19: + 0xc97eee (0x7f71b6349eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:frame #20: + 0x413ea4 (0x7f71b5ac5ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:frame #21: + 0x1445a6 (0x561717cab5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: pipeline_state.run_communication() [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:Traceback (most recent call last): [default2]: return func(*args, **kwargs) [default1]: recv_activation_tensor = recv_activation() [default1]:frame #22: _PyObject_MakeTpCall + 0x26b (0x561717ca4a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #23: + 0x150866 (0x561717cb7866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return forward_call(*args, **kwargs) [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default1]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x561717ca0142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #25: _PyFunction_Vectorcall + 0x6c (0x561717caba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #26: PyObject_Call + 0xbc (0x561717cb7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x561717c9e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #28: _PyFunction_Vectorcall + 0x6c (0x561717caba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default0]: dist.recv( [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default6]: sharded_logits = self.model( [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default2]: outputs = self.pipeline_engine.train_batch_iter( [default5]: trainer.train(dataloader) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default2]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x561717c9c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #30: + 0x150582 (0x561717cb7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]: return func(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: output = model(**micro_batch) [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7f43249d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default1]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x561717c9c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #32: + 0x150582 (0x561717cb7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x561717c9c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]: outputs = self.pipeline_engine.train_batch_iter( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]: return self._call_impl(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default1]:frame #1: + 0x589518e (0x7f7f7b20318e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f7f7b1fd9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f7f7b1fdce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]: pipeline_state.run_communication() [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]: recv_activation_tensor = recv_activation() [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:frame #34: + 0x150582 (0x561717cb7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x561717c9c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x561717ca3f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #37: _PyObject_Call_Prepend + 0x69 (0x561717cb5c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #38: + 0x211239 (0x561717d78239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #39: _PyObject_MakeTpCall + 0x26b (0x561717ca4a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x561717ca03e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]: recv_activation_tensor = recv_activation() [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]: pg.recv([tensor], group_src_rank, tag).wait() [default2]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:frame #41: _PyFunction_Vectorcall + 0x6c (0x561717caba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x561717c9bc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]: pg.recv([tensor], group_src_rank, tag).wait() [default1]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]: return self._call_impl(*args, **kwargs) [default5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:torch.distributed.DistBackendError: [8] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '7:8', but store->get('7:8') got error: Connection reset by peer [default1]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:frame #43: _PyFunction_Vectorcall + 0x6c (0x561717caba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x561717c9c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #45: + 0x150582 (0x561717cb7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default1]: pipeline_state.run_communication() [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f7f7b1feb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default1]: pipeline_state.run_communication() [default2]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:frame #46: PyObject_Call + 0xbc (0x561717cb7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x561717c9e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #48: + 0x150582 (0x561717cb7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]: pipeline_state.run_communication() [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7f7b1b3f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:frame #49: PyObject_Call + 0xbc (0x561717cb7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x561717c9e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #51: _PyFunction_Vectorcall + 0x6c (0x561717caba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x561717ca4007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]: outputs = self.pipeline_engine.train_batch_iter( [default1]: recv_activation_tensor = recv_activation() [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]: output = model(**micro_batch) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:frame #53: _PyObject_Call_Prepend + 0x69 (0x561717cb5c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #54: + 0x211239 (0x561717d78239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #55: PyObject_Call + 0x207 (0x561717cb8067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: output = model(**micro_batch) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]: dist.recv( [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7f7b1b3f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]: recv_activation_tensor = recv_activation() [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x561717c9e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #57: + 0x150582 (0x561717cb7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x561717c9c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default1]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]:torch.distributed.DistBackendError: [14] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '13:14', but store->get('13:14') got error: Connection reset by peer [default0]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default3]: return forward_call(*args, **kwargs) [default0]:Traceback (most recent call last): [default3]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]: dist.recv( [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default1]:frame #59: + 0x150582 (0x561717cb7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #60: PyObject_Call + 0xbc (0x561717cb7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x561717c9e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #62: + 0x150582 (0x561717cb7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #63: PyObject_Call + 0xbc (0x561717cb7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default4]: return self._call_impl(*args, **kwargs) [default6]: return self._call_impl(*args, **kwargs) [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd3f1073d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]: return func(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7f7b1b3f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default5]: outputs = self.pipeline_engine.train_batch_iter( [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9da8cc1d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]: return func(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default2]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:Traceback (most recent call last): [default0]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:frame #1: + 0x589518e (0x7fd42902d18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: pg.recv([tensor], group_src_rank, tag).wait() [default0]: trainer.train(dataloader) [default1]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7f7b1b3f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f7f443f1c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default3]:Traceback (most recent call last): [default5]:Traceback (most recent call last): [default0]:Traceback (most recent call last): [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]: trainer.train(dataloader) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]: recv_activation_tensor = recv_activation() [default0]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fd4290279a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default5]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default4]: output = model(**micro_batch) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default2]:frame #1: + 0x589518e (0x7f9de0c7b18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]: output = model(**micro_batch) [default0]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fd429027ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: return self._call_impl(*args, **kwargs) [default1]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f7f443f8c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f9de0c759a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default0]: trainer.train(dataloader) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fd429028b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd428fddf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd428fddf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd3f218bd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default1]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f7f4441bb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]: return forward_call(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:frame #1: + 0x589518e (0x7fd42a14518e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: dist.recv( [default2]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f9de0c75ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f391a387d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd428fddf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default1]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]: dist.recv( [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default2]: outputs = self.pipeline_engine.train_batch_iter( [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default4]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fd42a13f9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: sharded_logits = self.model( [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return func(*args, **kwargs) [default2]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f9de0c76b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]: return func(*args, **kwargs) [default2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]: outputs = self.pipeline_engine.train_batch_iter( [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd428fddf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fd42a13fce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default1]:frame #12: + 0x5838439 (0x7f7f7b1a6439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f9de0c2bf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]:frame #1: + 0x589518e (0x7f395234118e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]: return self._call_impl(*args, **kwargs) [default3]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:frame #13: + 0x5843330 (0x7f7f7b1b1330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #14: + 0x58433c5 (0x7f7f7b1b13c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #15: + 0x4e893cc (0x7f7f7a7f73cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f9de0c2bf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f9de0c2bf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]: output = model(**micro_batch) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fd3f221bc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f9de0c2bf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]: dist.recv( [default1]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f395233b9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fd42a140b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]:torch.distributed.DistBackendError: [12] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '11:12', but store->get('11:12') got error: Connection reset by peer [default1]:frame #16: + 0x1a08a88 (0x7f7f77376a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #17: + 0x5849a84 (0x7f7f7b1b7a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f9da9e69c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f395233bce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]: trainer.train(dataloader) [default6]: return forward_call(*args, **kwargs) [default0]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fd3f2222c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]:Traceback (most recent call last): [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default0]: outputs = self.pipeline_engine.train_batch_iter( [default3]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default2]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f395233cb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f39522f1f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: output = model(**micro_batch) [default0]: return self._call_impl(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: dist.recv( [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default7]: trainer.train(dataloader) [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff4b9d33d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f9da9e70c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f9da9e93b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default3]: trainer.train(dataloader) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default0]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fd3f2245b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default3]:frame #1: + 0x589518e (0x7ff4f1ced18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #12: + 0x5838439 (0x7f9de0c1e439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default0]: output = model(**micro_batch) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: return self._call_impl(*args, **kwargs) [default1]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]:frame #12: + 0x5838439 (0x7fd428fd0439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #13: + 0x5843330 (0x7fd428fdb330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7ff4f1ce79a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #13: + 0x5843330 (0x7f9de0c29330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f20f665ed87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]:Traceback (most recent call last): [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]: return forward_call(*args, **kwargs) [default0]:frame #14: + 0x58433c5 (0x7fd428fdb3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #15: + 0x4e893cc (0x7fd4286213cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7ff4f1ce7ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #14: + 0x58433c5 (0x7f9de0c293c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #15: + 0x4e893cc (0x7f9de026f3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #1: + 0x589518e (0x7f212e61818e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default0]:frame #16: + 0x1a08a88 (0x7fd4251a0a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: pipeline_state.run_communication() [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]: sharded_logits = self.model( [default5]: output = model(**micro_batch) [default1]:frame #18: + 0x584ed35 (0x7f7f7b1bcd35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #16: + 0x1a08a88 (0x7f9ddcdeea88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]: return func(*args, **kwargs) [default2]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f39522f1f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default5]: trainer.train(dataloader) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default6]: recv_activation_tensor = recv_activation() [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]: output = model(**micro_batch) [default6]:Traceback (most recent call last): [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:frame #17: + 0x5849a84 (0x7f9de0c2fa84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f212e6129a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: return self._call_impl(*args, **kwargs) [default4]: pipeline_state.run_communication() [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:frame #17: + 0x5849a84 (0x7fd428fe1a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default7]: outputs = self.pipeline_engine.train_batch_iter( [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default2]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f39522f1f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]: return func(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default1]:frame #19: + 0xc97eee (0x7f7f8da6eeee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:frame #20: + 0x413ea4 (0x7f7f8d1eaea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f212e612ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: return forward_call(*args, **kwargs) [default6]: sharded_logits = self.model( [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default0]:frame #18: + 0x584ed35 (0x7fd428fe6d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]: sharded_logits = self.model( [default3]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7ff4f1ce8b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:torch.distributed.DistBackendError: [8] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '7:8', but store->get('7:8') got error: Connection reset by peer [default2]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f39522f1f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]: pg.recv([tensor], group_src_rank, tag).wait() [default6]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]: return forward_call(*args, **kwargs) [default3]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff4f1c9df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #18: + 0x584ed35 (0x7f9de0c34d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f212e613b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default4]: recv_activation_tensor = recv_activation() [default1]:torch.distributed.DistBackendError: [14] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '13:14', but store->get('13:14') got error: Connection reset by peer [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]: dist.recv( [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default3]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff4f1c9df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default2]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f391b52fc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default2]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]: sharded_logits = self.model( [default1]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fac0b655d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default5]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd42a0f5f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]:Traceback (most recent call last): [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fddb2795d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #19: + 0xc97eee (0x7f9df34e6eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f212e5c8f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: outputs = self.pipeline_engine.train_batch_iter( [default4]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]: dist.recv( [default1]:frame #1: + 0x589518e (0x7fac4360f18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: return func(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]:frame #1: + 0x589518e (0x7fddea74f18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f391b536c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f391b559b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default1]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fac436099a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fac43609ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: pg.recv([tensor], group_src_rank, tag).wait() [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]:Traceback (most recent call last): [default1]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fddea7499a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f212e5c8f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default4]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]: return func(*args, **kwargs) [default1]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fac4360ab11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fddea749ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fddea74ab11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f212e5c8f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default3]: return forward_call(*args, **kwargs) [default0]: return self._call_impl(*args, **kwargs) [default1]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fac435bff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fac435bff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd42a0f5f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]: trainer.train(dataloader) [default1]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fddea6fff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f212e5c8f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f20f7806c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f20f780dc5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #12: + 0x5838439 (0x7f39522e4439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: outputs = self.pipeline_engine.train_batch_iter( [default2]: sharded_logits = self.model( [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fac435bff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fac435bff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default2]: return forward_call(*args, **kwargs) [default2]:Traceback (most recent call last): [default1]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fddea6fff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #20: + 0x413ea4 (0x7f9df2c62ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:frame #21: + 0x1445a6 (0x55c235b4e5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f20f7830b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:Traceback (most recent call last): [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fac0c7fdc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2f73e15d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]: return forward_call(*args, **kwargs) [default3]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff4f1c9df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fddea6fff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55c235b47a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #23: + 0x150866 (0x55c235b5a866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #13: + 0x5843330 (0x7f39522ef330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default4]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]: trainer.train(dataloader) [default1]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fac0c804c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd42a0f5f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: return self._call_impl(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default1]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fddea6fff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #12: + 0x5838439 (0x7f212e5bb439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: sharded_logits = self.model( [default5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]: return forward_call(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fac0c827b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #12: + 0x5838439 (0x7fac435b2439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #13: + 0x5843330 (0x7fac435bd330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #14: + 0x58433c5 (0x7fac435bd3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #19: + 0xc97eee (0x7fd43b898eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:frame #15: + 0x4e893[default6]:frame #1: + 0x589518e (0x7f2fabdcf18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default2]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55c235b43142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #14: + 0x58433c5 (0x7f39522ef3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]: pg.recv([tensor], group_src_rank, tag).wait() cc (0x7fac42c033cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #16: + 0x1a08a88 (0x7fac3f782a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f2fabdc99a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff4f1c9df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7ff4baedbc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55c235b4ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fddb393dc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #15: + 0x4e893cc (0x7f39519353cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #13: + 0x5843330 (0x7f212e5c6330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]:frame #20: + 0x413ea4 (0x7fd43b014ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f2fabdc9ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fd42a0f5f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]: trainer.train(dataloader) [default1]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fddb3944c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #16: + 0x1a08a88 (0x7f394e4b4a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #17: + 0x5849a84 (0x7f39522f5a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: return self._call_impl(*args, **kwargs) [default0]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:frame #17: + 0x5849a84 (0x7fac435c3a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f2fabdcab11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2fabd7ff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default2]: trainer.train(dataloader) [default1]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fddb3967b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #14: + 0x58433c5 (0x7f212e5c63c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #15: + 0x4e893cc (0x7f212dc0c3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]:torch.distributed.DistBackendError: [6] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '5:6', but store->get('5:6') got error: Connection reset by peer [default1]:frame #18: + 0x584ed35 (0x7fac435c8d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #19: + 0xc97eee (0x7fac55e7aeee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2fabd7ff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2fabd7ff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default1]:frame #12: + 0x5838439 (0x7fddea6f2439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #16: + 0x1a08a88 (0x7f212a78ba88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #17: + 0x5849a84 (0x7f212e5cca84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: output = model(**micro_batch) [default5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:frame #21: + 0x1445a6 (0x557fc753c5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2fabd7ff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]: output = model(**micro_batch) [default4]: trainer.train(dataloader) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default1]:frame #13: + 0x5843330 (0x7fddea6fd330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #14: + 0x58433c5 (0x7fddea6fd3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #18: + 0x584ed35 (0x7f39522fad35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #19: + 0xc97eee (0x7f3964baceee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: return self._call_impl(*args, **kwargs) [default1]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default1]:frame #20: + 0x413ea4 (0x7fac555f6ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:frame #21: + 0x1445a6 (0x55d2ffb565a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fd3f3333c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:frame #26: PyObject_Call + 0xbc (0x55c235b5af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #20: + 0x413ea4 (0x7f3964328ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:frame #21: + 0x1445a6 (0x55be000de5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default0]:frame #22: _PyObject_MakeTpCall + 0x26b (0x557fc7535a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f2f74fbdc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default3]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7ff4baee2c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #15: + 0x4e893cc (0x7fdde9d433cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55be000d7a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: output = model(**micro_batch) [default5]: outputs = self.pipeline_engine.train_batch_iter( [default6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:frame #23: + 0x150866 (0x557fc7548866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fd3f333ac5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: sharded_logits = self.model( [default2]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55c235b412b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55c235b4ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55c235b3f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #18: + 0x584ed35 (0x7f212e5d1d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x557fc7531142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #25: _PyFunction_Vectorcall + 0x6c (0x557fc753ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fd3f335db60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f2f74fc4c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f2f74fe7b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]: return self._call_impl(*args, **kwargs) [default0]: sharded_logits = self.model( [default4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]: outputs = self.pipeline_engine.train_batch_iter( [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default2]:frame #30: + 0x150582 (0x55c235b5a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #23: + 0x150866 (0x55be000ea866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: return self._call_impl(*args, **kwargs) [default3]: sharded_logits = self.model( [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: return self._call_impl(*args, **kwargs) [default1]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55d2ffb4fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #23: + 0x150866 (0x55d2ffb62866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55d2ffb4b142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #12: + 0x5838439 (0x7f2fabd72439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: return self._call_impl(*args, **kwargs) [default1]:frame #21: + 0x1445a6 (0x5577f18e05a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #22: _PyObject_MakeTpCall + 0x26b (0x5577f18d9a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #16: + 0x1a08a88 (0x7fdde68c2a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #19: + 0xc97eee (0x7f2140e83eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: return forward_call(*args, **kwargs) [default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fe8ee4bbd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #26: PyObject_Call + 0xbc (0x557fc7548f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #12: + 0x5838439 (0x7fd42a0e8439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default0]: outputs = self.pipeline_engine.train_batch_iter( [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default1]:frame #17: + 0x5849a84 (0x7fddea703a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55c235b3f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55be000d3142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default0]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x557fc752f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55d2ffb56a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #26: PyObject_Call + 0xbc (0x55d2ffb62f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #13: + 0x5843330 (0x7fd42a0f3330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:frame #32: + 0x150582 (0x55c235b5a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55be000dea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #20: + 0x413ea4 (0x7f21405ffea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]: return self._call_impl(*args, **kwargs) [default4]: dist.recv( [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default0]:frame #28: _PyFunction_Vectorcall + 0x6c (0x557fc753ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x557fc752d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #30: + 0x150582 (0x557fc7548582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55d2ffb492b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #14: + 0x58433c5 (0x7fd42a0f33c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]: return self._call_impl(*args, **kwargs) [default1]:frame #18: + 0x584ed35 (0x7fddea708d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #21: + 0x1445a6 (0x55bd4bc9c5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55bd4bc95a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: return forward_call(*args, **kwargs) [default3]: return forward_call(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x557fc752d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #15: + 0x4e893cc (0x7fd4297393cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: output = model(**micro_batch) [default2]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55c235b3f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #23: + 0x150866 (0x55bd4bca8866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55bd4bc91142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55d2ffb56a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55d2ffb478fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #16: + 0x1a08a88 (0x7fd4262b8a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #13: + 0x5843330 (0x7f2fabd7d330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7ff4baf05b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #19: + 0xc97eee (0x7fddfcfbaeee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55bd4bc9ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default1]:frame #1: + 0x589518e (0x7fe92647518e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: return func(*args, **kwargs) [default1]:frame #30: + 0x150582 (0x55d2ffb62582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #14: + 0x58433c5 (0x7f2fabd7d3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #15: + 0x4e893cc (0x7f2fab3c33cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]:frame #34: + 0x150582 (0x55c235b5a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #26: PyObject_Call + 0xbc (0x55be000eaf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: sharded_logits = self.model( [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default0]:frame #32: + 0x150582 (0x557fc7548582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #17: + 0x5849a84 (0x7fd42a0f9a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: pipeline_state.run_communication() [default6]: return self._call_impl(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default1]:frame #20: + 0x413ea4 (0x7fddfc736ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:frame #26: PyObject_Call + 0xbc (0x55bd4bca8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default1]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55d2ffb478fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #32: + 0x150582 (0x55d2ffb62582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:Traceback (most recent call last): [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:frame #18: + 0x584ed35 (0x7fd42a0fed35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]: outputs = self.pipeline_engine.train_batch_iter( [default1]:frame #21: + 0x1445a6 (0x5638286e05a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55c235b3f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55bd4bc8f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55be000d12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55be000dea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default1]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55d2ffb478fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #34: + 0x150582 (0x55d2ffb62582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55d2ffb478fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x557fc752d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: trainer.train(dataloader) [default5]:frame #19: + 0xc97eee (0x7fd43c9b0eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default1]:frame #22: _PyObject_MakeTpCall + 0x26b (0x5638286d9a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55be000cf8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: outputs = self.pipeline_engine.train_batch_iter( [default1]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55d2ffb4ef50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55c235b46f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55bd4bc9ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:frame #34: + 0x150582 (0x557fc7548582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #20: + 0x413ea4 (0x7fd43c12cea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:frame #21: + 0x1445a6 (0x5648512ec5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: sharded_logits = self.model( [default0]: return forward_call(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:frame #23: + 0x150866 (0x5638286ec866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5638286d5142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55bd4bc8d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: pg.recv([tensor], group_src_rank, tag).wait() [default1]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55d2ffb60c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default2]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55c235b58c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #30: + 0x150582 (0x55bd4bca8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55bd4bc8d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:frame #38: + 0x211239 (0x55d2ffc23239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55d2ffb4fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #22: _PyObject_MakeTpCall + 0x26b (0x5648512e5a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]: recv_activation_tensor = recv_activation() [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:frame #23: + 0x150866 (0x5577f18ec866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:frame #25: _PyFunction_Vectorcall + 0x6c (0x5638286e0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #30: + 0x150582 (0x55be000ea582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #32: + 0x150582 (0x55bd4bca8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55d2ffb4b3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55d2ffb56a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: return self._call_impl(*args, **kwargs) [default5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]:frame #26: PyObject_Call + 0xbc (0x5638286ecf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #38: + 0x211239 (0x55c235c1b239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55bd4bc8d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55be000cf8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #34: + 0x150582 (0x55bd4bca8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55bd4bc8d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55d2ffb46c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55d2ffb56a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #23: + 0x150866 (0x5648512f8866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5648512e1142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #25: _PyFunction_Vectorcall + 0x6c (0x5648512eca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5577f18d5142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:Traceback (most recent call last): [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5638286d32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #28: _PyFunction_Vectorcall + 0x6c (0x5638286e0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #32: + 0x150582 (0x55be000ea582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x557fc752d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x557fc7534f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #16: + 0x1a08a88 (0x7f2fa7f42a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]:frame #25: _PyFunction_Vectorcall + 0x6c (0x5577f18e0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #26: PyObject_Call + 0xbc (0x5577f18ecf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: trainer.train(dataloader) [default2]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55c235b47a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55c235b433e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5638286d18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55be000cf8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55bd4bc94f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55bd4bca6c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: return self._call_impl(*args, **kwargs) [default0]: return forward_call(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]:frame #37: _PyObject_Call_Prepend + 0x69 (0x557fc7546c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55d2ffb478fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #45: + 0x150582 (0x55d2ffb62582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #26: PyObject_Call + 0xbc (0x5648512f8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return self._call_impl(*args, **kwargs) [default1]:frame #30: + 0x150582 (0x5638286ec582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55c235b4ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5638286d18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #38: + 0x211239 (0x55bd4bd69239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]:frame #38: + 0x211239 (0x557fc7609239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #17: + 0x5849a84 (0x7f2fabd83a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55c235b3ec5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #34: + 0x150582 (0x55be000ea582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55be000cf8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: sharded_logits = self.model( [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:frame #46: PyObject_Call + 0xbc (0x55d2ffb62f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55d2ffb492b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #48: + 0x150582 (0x55d2ffb62582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #49: PyObject_Call + 0xbc (0x55d2ffb62f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5648512df2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: pipeline_state.run_communication() [default3]:frame #12: + 0x5838439 (0x7ff4f1c90439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #13: + 0x5843330 (0x7ff4f1c9b330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #32: + 0x150582 (0x5638286ec582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5638286d18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55bd4bc95a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]: return self._call_impl(*args, **kwargs) [default1]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fe92646f9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:torch.distributed.DistBackendError: [7] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '6:7', but store->get('6:7') got error: Connection reset by peer [default4]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default1]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55d2ffb492b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #28: _PyFunction_Vectorcall + 0x6c (0x5648512eca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5648512dd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:frame #14: + 0x58433c5 (0x7ff4f1c9b3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default6]: sharded_logits = self.model( [default2]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55c235b4ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55c235b3f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55be000d6f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4d5457dd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]: output = model(**micro_batch) [default1]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55d2ffb56a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55d2ffb4f007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55d2ffb60c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #18: + 0x584ed35 (0x7f2fabd88d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: return self._call_impl(*args, **kwargs) [default4]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]: outputs = self.pipeline_engine.train_batch_iter( [default1]:frame #34: + 0x150582 (0x5638286ec582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default1]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55bd4bc913e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default0]:frame #39: _PyObject_MakeTpCall + 0x26b (0x557fc7535a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #30: + 0x150582 (0x5648512f8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5648512dd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5577f18d32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5638286d18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5638286d8f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #37: _PyObject_Call_Prepend + 0x69 (0x5638286eac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55be000e8c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #38: + 0x211239 (0x55be001ab239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: return forward_call(*args, **kwargs) [default5]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]:frame #54: + 0x211239 (0x55d2ffc23239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: outputs = self.pipeline_engine.train_batch_iter( [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]:Traceback (most recent call last): [default1]:frame #38: + 0x211239 (0x5638287ad239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55be000d7a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: return self._call_impl(*args, **kwargs) [default1]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fe92646fce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x557fc75313e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default5]:frame #32: + 0x150582 (0x5648512f8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #19: + 0xc97eee (0x7f2fbe63aeee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]: recv_activation_tensor = recv_activation() [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:frame #45: + 0x150582 (0x55c235b5a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default1]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55bd4bc9ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]: return self._call_impl(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]:frame #41: _PyFunction_Vectorcall + 0x6c (0x557fc753ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x557fc752cc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #20: + 0x413ea4 (0x7f2fbddb6ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]: return forward_call(*args, **kwargs) [default0]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default1]:frame #39: _PyObject_MakeTpCall + 0x26b (0x5638286d9a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5638286d53e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55be000d33e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: output = model(**micro_batch) [default1]:frame #55: _PyObject_MakeTpCall + 0x26b (0x55d2ffb4fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #43: _PyFunction_Vectorcall + 0x6c (0x557fc753ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x557fc752d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]: outputs = self.pipeline_engine.train_batch_iter( [default2]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55be000dea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55bd4bc8cc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fe926470b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #56: _PyEval_EvalFrameDefault + 0x5723 (0x55d2ffb4bc53 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5648512dd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #34: + 0x150582 (0x5648512f8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]:frame #41: _PyFunction_Vectorcall + 0x6c (0x5638286e0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55be000cec5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default1]:frame #57: + 0x150582 (0x55d2ffb62582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55d2ffb478fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]: output = model(**micro_batch) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:frame #15: + 0x4e893cc (0x7ff4f12e13cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #28: _PyFunction_Vectorcall + 0x6c (0x5577f18e0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5638286d0c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55bd4bc9ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]: pipeline_state.run_communication() [default0]: pipeline_state.run_communication() [default0]:frame #45: + 0x150582 (0x557fc7548582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5648512dd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:frame #43: _PyFunction_Vectorcall + 0x6c (0x5638286e0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default2]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55be000dea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]: return forward_call(*args, **kwargs) [default3]: return forward_call(*args, **kwargs) [default5]: return self._call_impl(*args, **kwargs) [default1]:frame #59: + 0x150582 (0x55d2ffb62582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #60: PyObject_Call + 0xbc (0x55d2ffb62f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]: return self._call_impl(*args, **kwargs) [default2]:frame #46: PyObject_Call + 0xbc (0x55c235b5af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55bd4bc8d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #45: + 0x150582 (0x55bd4bca8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55d2ffb492b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #21: + 0x1445a6 (0x5619ab3975a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]: dist.recv( [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default1]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5638286d18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #46: PyObject_Call + 0xbc (0x55bd4bca8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:frame #1: + 0x589518e (0x7f4d8c53718e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #46: PyObject_Call + 0xbc (0x557fc7548f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:frame #16: + 0x1a08a88 (0x7ff4ede60a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55bd4bc8f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #48: + 0x150582 (0x55bd4bca8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]: return forward_call(*args, **kwargs) [default1]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe926425f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #62: + 0x150582 (0x55d2ffb62582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5648512e4f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #22: _PyObject_MakeTpCall + 0x26b (0x5619ab390a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return forward_call(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]: return func(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:frame #17: + 0x5849a84 (0x7ff4f1ca1a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55c235b412b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #49: PyObject_Call + 0xbc (0x55bd4bca8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55be000cf8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x557fc752f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #23: + 0x150866 (0x5619ab3a3866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #37: _PyObject_Call_Prepend + 0x69 (0x5648512f6c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #38: + 0x211239 (0x5648513b9239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]: output = model(**micro_batch) [default1]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55bd4bc8f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f4d8c5319a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #63: PyObject_Call + 0xbc (0x55d2ffb62f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #39: _PyObject_MakeTpCall + 0x26b (0x5648512e5a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default1]:frame #45: + 0x150582 (0x5638286ec582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #45: + 0x150582 (0x55be000ea582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #46: PyObject_Call + 0xbc (0x55be000eaf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55bd4bc9ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]: return forward_call(*args, **kwargs) [default0]:frame #48: + 0x150582 (0x557fc7548582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5648512e13e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]:frame #46: PyObject_Call + 0xbc (0x5638286ecf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55be000d12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default6]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5619ab38c142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default7]: return self._call_impl(*args, **kwargs) [default1]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5577f18d18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5638286d32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #48: + 0x150582 (0x55c235b5a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #48: + 0x150582 (0x5638286ec582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #48: + 0x150582 (0x55be000ea582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55bd4bc95007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe926425f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: return forward_call(*args, **kwargs) [default0]:frame #49: PyObject_Call + 0xbc (0x557fc7548f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #41: _PyFunction_Vectorcall + 0x6c (0x5648512eca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]: trainer.train(dataloader) [default3]:frame #18: + 0x584ed35 (0x7ff4f1ca6d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:frame #49: PyObject_Call + 0xbc (0x5638286ecf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55bd4bca6c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default0]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x557fc752f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5648512dcc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: sharded_logits = self.model( [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default4]: output = model(**micro_batch) [default2]:frame #49: PyObject_Call + 0xbc (0x55c235b5af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #54: + 0x211239 (0x55bd4bd69239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #55: PyObject_Call + 0x207 (0x55bd4bca9067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: pipeline_state.run_communication() [default5]: sharded_logits = self.model( [default2]: return forward_call(*args, **kwargs) [default0]:frame #51: _PyFunction_Vectorcall + 0x6c (0x557fc753ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x557fc7535007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #53: _PyObject_Call_Prepend + 0x69 (0x557fc7546c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #25: _PyFunction_Vectorcall + 0x6c (0x5619ab397a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #26: PyObject_Call + 0xbc (0x5619ab3a3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5619ab38a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: pg.recv([tensor], group_src_rank, tag).wait() [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]:frame #30: + 0x150582 (0x5577f18ec582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: return self._call_impl(*args, **kwargs) [default1]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55bd4bc8f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #57: + 0x150582 (0x55bd4bca8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: recv_activation_tensor = recv_activation() [default0]: pipeline_state.run_communication() [default4]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f4d8c531ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f4d8c532b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #54: + 0x211239 (0x557fc7609239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #55: _PyObject_MakeTpCall + 0x26b (0x557fc7535a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #56: _PyEval_EvalFrameDefault + 0x5723 (0x557fc7531c53 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #28: _PyFunction_Vectorcall + 0x6c (0x5619ab397a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default1]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5638286d32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #49: PyObject_Call + 0xbc (0x55be000eaf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]: recv_activation_tensor = recv_activation() [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]:frame #57: + 0x150582 (0x557fc7548582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x557fc752d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #59: + 0x150582 (0x557fc7548582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #60: PyObject_Call + 0xbc (0x557fc7548f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #43: _PyFunction_Vectorcall + 0x6c (0x5648512eca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]: output = model(**micro_batch) [default2]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55c235b412b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55be000d12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55be000dea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x557fc752f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #62: + 0x150582 (0x557fc7548582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #63: PyObject_Call + 0xbc (0x557fc7548f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5619ab3888fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default2]: dist.recv( [default2]: output = model(**micro_batch) [default2]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55c235b4ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55be000d7007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default6]:frame #30: + 0x150582 (0x5619ab3a3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5648512dd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5577f18d18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55c235b47007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default1]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55bd4bc8d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #59: + 0x150582 (0x55bd4bca8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: return self._call_impl(*args, **kwargs) [default3]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]:Traceback (most recent call last): [default5]:frame #45: + 0x150582 (0x5648512f8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #46: PyObject_Call + 0xbc (0x5648512f8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5648512df2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return forward_call(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]: return forward_call(*args, **kwargs) [default1]:frame #51: _PyFunction_Vectorcall + 0x6c (0x5638286e0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default1]:frame #60: PyObject_Call + 0xbc (0x55bd4bca8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:frame #48: + 0x150582 (0x5648512f8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55c235b58c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5638286d9007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55be000e8c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4d8c4e7f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4d8c4e7f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: trainer.train(dataloader) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default7]: return forward_call(*args, **kwargs) [default4]: pipeline_state.run_communication() [default3]:frame #19: + 0xc97eee (0x7ff504558eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #20: + 0x413ea4 (0x7ff503cd4ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:frame #53: _PyObject_Call_Prepend + 0x69 (0x5638286eac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #54: + 0x211239 (0x5638287ad239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #54: + 0x211239 (0x55be001ab239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #55: PyObject_Call + 0x207 (0x55be000eb067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default2]: sharded_logits = self.model( [default7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:Traceback (most recent call last): [default5]:frame #49: PyObject_Call + 0xbc (0x5648512f8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default3]:torch.distributed.DistBackendError: [4] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '3:4', but store->get('3:4') got error: Connection reset by peer [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:frame #55: PyObject_Call + 0x207 (0x5638286ed067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: sharded_logits = self.model( [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55be000d12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]: recv_activation_tensor = recv_activation() [default0]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]: trainer.train(dataloader) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default1]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5638286d32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #54: + 0x211239 (0x55c235c1b239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55bd4bc8f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: return forward_call(*args, **kwargs) [default0]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default2]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5648512df2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #51: _PyFunction_Vectorcall + 0x6c (0x5648512eca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: return func(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]:frame #57: + 0x150582 (0x55be000ea582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default1]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe926425f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe926425f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]: recv_activation_tensor = recv_activation() [default3]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default4]: return self._call_impl(*args, **kwargs) [default1]:frame #57: + 0x150582 (0x5638286ec582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #62: + 0x150582 (0x55bd4bca8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: recv_activation_tensor = recv_activation() [default6]: pipeline_state.run_communication() [default4]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4d8c4e7f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: outputs = self.pipeline_engine.train_batch_iter( [default3]:Traceback (most recent call last): [default5]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5648512e5007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2b284ded87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default1]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5638286d18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55be000cf8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]: pipeline_state.run_communication() [default2]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default2]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default2]:frame #55: PyObject_Call + 0x207 (0x55c235b5b067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #63: PyObject_Call + 0xbc (0x55bd4bca8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default3]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4d8c4e7f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: trainer.train(dataloader) [default6]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5619ab3888fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #53: _PyObject_Call_Prepend + 0x69 (0x5648512f6c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:frame #1: + 0x589518e (0x7f2b6049818e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: return self._call_impl(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default2]:frame #59: + 0x150582 (0x55be000ea582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #60: PyObject_Call + 0xbc (0x55be000eaf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55be000d12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fe8ef663c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]: return forward_call(*args, **kwargs) [default3]:frame #21: + 0x1445a6 (0x55a04ba9c5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: outputs = self.pipeline_engine.train_batch_iter( [default1]:frame #59: + 0x150582 (0x5638286ec582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #60: PyObject_Call + 0xbc (0x5638286ecf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #62: + 0x150582 (0x55be000ea582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #63: PyObject_Call + 0xbc (0x55be000eaf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:frame #32: + 0x150582 (0x5619ab3a3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: pg.recv([tensor], group_src_rank, tag).wait() [default5]: pipeline_state.run_communication() [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5638286d32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #62: + 0x150582 (0x5638286ec582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:Traceback (most recent call last): [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default3]: trainer.train(dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default3]: outputs = self.pipeline_engine.train_batch_iter( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]: output = [default2]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]: recv_activation_tensor = recv_activation() [default3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]: output = model(**micro_batch) [default5]:frame #54: + 0x211239 (0x5648513b9239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #55: PyObject_Call + 0x207 (0x5648512f9067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]: recv_activation_tensor = recv_activation() [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] model(**micro_batch) [default3]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]: outputs = self.pipeline_engine.train_batch_iter( [default6]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5619ab3888fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #34: + 0x150582 (0x5619ab3a3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f2b604929a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default1]:frame #63: PyObject_Call + 0xbc (0x5638286ecf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default3]: sharded_logits = self.model( [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default3[default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fe8ef66ac5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return self._call_impl(*args, **kwargs) [default5]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5648512df2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f2b60492ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states ]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]: return forward_call(*args, **kwargs) [default0]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default5]:frame #57: + 0x150582 (0x5648512f8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5619ab3888fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]: return self._call_impl(*args, **kwargs) [default2]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55c235b412b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]: pipeline_state.run_communication() [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]: recv_activation_tensor = recv_activation() [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_te[default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5648512dd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #59: + 0x150582 (0x5648512f8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5619ab38ff50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return self._call_impl(*args, **kwargs) [default5]:frame #60: PyObject_Call + 0xbc (0x5648512f8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]:frame #57: + 0x150582 (0x55c235b5a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55c235b3f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) nsors [default3]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]: dist.recv( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]: recv_activation_tensor = recv_activation() [default2]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]:torch.distributed.DistBackendError: [4] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '3:4', but store->get('3:4') got error: Connection reset by peer [default0]: pipeline_state.run_communication() [default6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]:Traceback (most recent call last): [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: dist.recv( [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default3]: return func(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default3]: pg.recv([tensor], group_src_rank, tag).wait() [default5]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default5]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5648512df2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:frame #59: + 0x150582 (0x55c235b5a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: trainer.train(dataloader) [default2]:frame #60: PyObject_Call + 0xbc (0x55c235b5af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default3]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default1]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fe8ef68db60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]: output = model(**micro_batch) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f2b60493b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]: return self._call_impl(*args, **kwargs) [default2]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55c235b412b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9ecb875d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #1: + 0x589518e (0x7f9f0382f18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]: return self._call_impl(*args, **kwargs) [default5]:frame #62: + 0x150582 (0x5648512f8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #63: PyObject_Call + 0xbc (0x5648512f8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: pipeline_state.run_communication() [default4]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default3]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f9f038299a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f9f03829ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default7]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default3]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f9f0382ab11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f9f037dff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f9f037dff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:frame #12: + 0x5838439 (0x7fe926418439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: return forward_call(*args, **kwargs) [default6]:frame #37: _PyObject_Call_Prepend + 0x69 (0x5619ab3a1c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #38: + 0x211239 (0x5619ab464239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2b60448f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f9f037dff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f9f037dff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: pipeline_state.run_communication() [default1]:frame #13: + 0x5843330 (0x7fe926423330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]: outputs = self.pipeline_engine.train_batch_iter( [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]: pipeline_state.run_communication() [default3]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2b60448f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2b60448f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: output = model(**micro_batch) [default4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default3]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f9ecca1dc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]: return func(*args, **kwargs) [default0]: dist.recv( [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default5]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default7]: recv_activation_tensor = recv_activation() [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]: return forward_call(*args, **kwargs) [default2]:frame #62: + 0x150582 (0x55c235b5a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #63: PyObject_Call + 0xbc (0x55c235b5af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f9ecca24c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f9ecca47b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #12: + 0x5838439 (0x7f9f037d2439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]: sharded_logits = self.model( [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default3]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55a04ba95a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #23: + 0x150866 (0x55a04baa8866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: outputs = self.pipeline_engine.train_batch_iter( [default3]:frame #13: + 0x5843330 (0x7f9f037dd330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #14: + 0x58433c5 (0x7f9f037dd3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #15: + 0x4e893cc (0x7f9f02e233cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]: sharded_logits = self.model( [default7]: recv_activation_tensor = recv_activation() [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5179cb8d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]: return self._call_impl(*args, **kwargs) [default4]: sharded_logits = self.model( [default5]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:frame #16: + 0x1a08a88 (0x7f9eff9a2a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #17: + 0x5849a84 (0x7f9f037e3a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]:frame #14: + 0x58433c5 (0x7fe9264233c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:frame #1: + 0x589518e (0x7f51b1c7218e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default3]:frame #18: + 0x584ed35 (0x7f9f037e8d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #19: + 0xc97eee (0x7f9f1609aeee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #20: + 0x413ea4 (0x7f9f15816ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #21: + 0x1445a6 (0x55eb169875a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55eb16980a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: recv_activation_tensor = recv_activation() [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]:frame #39: _PyObject_MakeTpCall + 0x26b (0x5619ab390a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2b60448f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: sharded_logits = self.model( [default3]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55a04ba91142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:frame #23: + 0x150866 (0x55eb16993866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55eb1697c142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55eb16987a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default6]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]: output = model(**micro_batch) [default7]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f51b1c6c9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]:frame #26: PyObject_Call + 0xbc (0x55eb16993f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55eb1697a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55eb16987a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55eb169788fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: dist.recv( [default3]: dist.recv( [default2]: pg.recv([tensor], group_src_rank, tag).wait() [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5619ab38c3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]: sharded_logits = self.model( [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: pipeline_state.run_communication() [default3]:frame #30: + 0x150582 (0x55eb16993582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55eb169788fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #32: + 0x150582 (0x55eb16993582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55eb169788fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default4]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f4d55725c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default2]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:frame #34: + 0x150582 (0x55eb16993582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55eb169788fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55eb1697ff50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55eb16991c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:frame #15: + 0x4e893cc (0x7fe925a693cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]:frame #41: _PyFunction_Vectorcall + 0x6c (0x5619ab397a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5619ab387c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f2b29686c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: recv_activation_tensor = recv_activation() [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]:frame #38: + 0x211239 (0x55eb16a54239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55eb16980a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55eb1697c3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:torch.distributed.DistBackendError: [10] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '9:10', but store->get('9:10') got error: Connection reset by peer [default0]: return func(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]:frame #43: _PyFunction_Vectorcall + 0x6c (0x5619ab397a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5619ab3888fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f2b2968dc5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]: return self._call_impl(*args, **kwargs) [default4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55eb16987a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55eb16977c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55eb16987a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55eb169788fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]: return forward_call(*args, **kwargs) [default7]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]: output = model(**micro_batch) [default3]:frame #45: + 0x150582 (0x55eb16993582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #46: PyObject_Call + 0xbc (0x55eb16993f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55eb1697a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #48: + 0x150582 (0x55eb16993582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: return func(*args, **kwargs) [default2]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default1]:frame #16: + 0x1a08a88 (0x7fe9225e8a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]: return self._call_impl(*args, **kwargs) [default2]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]: dist.recv( [default4]: return self._call_impl(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]:frame #49: PyObject_Call + 0xbc (0x55eb16993f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55eb1697a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55eb16987a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55eb16980007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]: dist.recv( [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:frame #45: + 0x150582 (0x5619ab3a3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]: return self._call_impl(*args, **kwargs) [default3]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55eb16991c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #54: + 0x211239 (0x55eb16a54239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #55: PyObject_Call + 0x207 (0x55eb16994067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: pg.recv([tensor], group_src_rank, tag).wait() [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default0]: recv_activation_tensor = recv_activation() [default2]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f51b1c6cce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: return forward_call(*args, **kwargs) [default5]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55eb1697a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #57: + 0x150582 (0x55eb16993582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55eb169788fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #59: + 0x150582 (0x55eb16993582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: pg.recv([tensor], group_src_rank, tag).wait() [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:frame #46: PyObject_Call + 0xbc (0x5619ab3a3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]: dist.recv( [default1]:frame #32: + 0x150582 (0x5577f18ec582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]:frame #60: PyObject_Call + 0xbc (0x55eb16993f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55eb1697a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #62: + 0x150582 (0x55eb16993582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #63: PyObject_Call + 0xbc (0x55eb16993f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:torch.distributed.DistBackendError: [10] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '9:10', but store->get('9:10') got error: Connection reset by peer [default4]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f4d5572cc5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default3]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f2b296b0b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default5]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f4d5574fb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 786, in forward_with_hidden_states [default6]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5619ab38a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55a04ba9ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: return forward_call(*args, **kwargs) [default5]:Traceback (most recent call last): [default3]:torch.distributed.DistBackendError: [10] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '9:10', but store->get('9:10') got error: Connection reset by peer [default3]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default5]: return self._call_impl(*args, **kwargs) [default0]: return func(*args, **kwargs) [default2]: return self._call_impl(*args, **kwargs) [default6]: return func(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 782, in forward_with_hidden_states [default7]: dist.recv( [default4]: return func(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default4]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]: trainer.train(dataloader) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default5]: outputs = self.pipeline_engine.train_batch_iter( [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default2]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default4]:frame #12: + 0x5838439 (0x7f4d8c4da439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #13: + 0x5843330 (0x7f4d8c4e5330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: fp32_sharded_logits = self.cast_to_fp32(x=sharded_logits)["output"] [default6]:frame #48: + 0x150582 (0x5619ab3a3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default6]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default0]: pg.recv([tensor], group_src_rank, tag).wait() [default2]: hidden_states = self.final_layer_norm(input=hidden_encoder_states["hidden_states"])["hidden_states"] [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default6]:frame #49: PyObject_Call + 0xbc (0x5619ab3a3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return func(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc8a69d8d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]:frame #14: + 0x58433c5 (0x7f4d8c4e53c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #15: + 0x4e893cc (0x7f4d8bb2b3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: sharded_logits = self.model( [default6]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5619ab38a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: dist.recv( [default5]: output = model(**micro_batch) [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4b0c83ad87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default5]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default0]:Traceback (most recent call last): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f21c0922d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]:torch.distributed.DistBackendError: [6] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '5:6', but store->get('5:6') got error: Connection reset by peer [default4]:frame #16: + 0x1a08a88 (0x7f4d886aaa88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]:frame #51: _PyFunction_Vectorcall + 0x6c (0x5619ab397a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #12: + 0x5838439 (0x7f2b6043b439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]: return func(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default2]:frame #1: + 0x589518e (0x7f4b447f418e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #17: + 0x5849a84 (0x7f4d8c4eba84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: return self._call_impl(*args, **kwargs) [default7]: return func(*args, **kwargs) [default3]:frame #13: + 0x5843330 (0x7f2b60446330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default5]: pg.recv([tensor], group_src_rank, tag).wait() [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]:frame #1: + 0x589518e (0x7fc8de99218e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #17: + 0x5849a84 (0x7fe926429a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: return self._call_impl(*args, **kwargs) [default2]: return self._call_impl(*args, **kwargs) [default6]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5619ab390007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f51b1c6db11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #26: PyObject_Call + 0xbc (0x55a04baa8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: sharded_logits = self.model( [default4]:Traceback (most recent call last): [default0]:frame #1: + 0x589518e (0x7f21f88dc18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default3]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default7]: pg.recv([tensor], group_src_rank, tag).wait() [default0]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]: sharded_logits = self.model( [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: return self._call_impl(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]: trainer.train(dataloader) [default3]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fc8de98c9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:torch.distributed.DistBackendError: [7] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '6:7', but store->get('6:7') got error: Connection reset by peer [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]:frame #53: _PyObject_Call_Prepend + 0x69 (0x5619ab3a1c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default6]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default2]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f4b447ee9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]:torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0:1', but store->get('0:1') got error: Connection reset by peer [default7]: pg.recv([tensor], group_src_rank, tag).wait() [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default4]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]: return self._call_impl(*args, **kwargs) [default0]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f21f88d69a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9187f90d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default1]:frame #18: + 0x584ed35 (0x7fe92642ed35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #19: + 0xc97eee (0x7fe938ce0eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]: return forward_call(*args, **kwargs) [default6]:frame #54: + 0x211239 (0x5619ab464239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #14: + 0x58433c5 (0x7f2b604463c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f51b1c22f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: pg.recv([tensor], group_src_rank, tag).wait() [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: return forward_call(*args, **kwargs) [default4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fc8de98cce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default0]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:torch.distributed.DistBackendError: [9] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '8:9', but store->get('8:9') got error: Connection reset by peer [default5]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f4b447eece2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: dist.recv( [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:frame #55: PyObject_Call + 0x207 (0x5619ab3a4067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:torch.distributed.DistBackendError: [5] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '4:5', but store->get('4:5') got error: Connection reset by peer [default3]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55a04ba8f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9355ec7d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default0]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f21f88d6ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default1]:frame #20: + 0x413ea4 (0x7fe93845cea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f67227f4d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:frame #1: + 0x589518e (0x7f938de8118e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: trainer.train(dataloader) [default5]: return forward_call(*args, **kwargs) [default0]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f21f88d7b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: pipeline_state.run_communication() [default3]: return forward_call(*args, **kwargs) [default6]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5619ab38a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #15: + 0x4e893cc (0x7f2b5fa8c3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: outputs = self.pipeline_engine.train_batch_iter( [default2]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f4b447efb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: return forward_call(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]: pipeline_state.run_communication() [default7]:frame #1: + 0x589518e (0x7f675a7ae18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f51b1c22f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]: return self._call_impl(*args, **kwargs) [default5]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f938de7b9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f938de7bce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default3]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fc8de98db11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default2]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f675a7a89a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #16: + 0x1a08a88 (0x7f2b5c60ba88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:torch.distributed.DistBackendError: [5] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '4:5', but store->get('4:5') got error: Connection reset by peer [default7]: return self._call_impl(*args, **kwargs) [default5]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f938de7cb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f938de31f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f938de31f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default5]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f92b3ba3d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:frame #57: + 0x150582 (0x5619ab3a3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f938de31f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f938de31f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default0]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f21f888cf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #1: + 0x589518e (0x7f92ebb5d18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f675a7a8ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f51b1c22f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55a04ba9ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55a04ba8d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #30: + 0x150582 (0x55a04baa8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f935706fc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]: sharded_logits = self.model( [default2]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4b447a4f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f92ebb579a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: recv_activation_tensor = recv_activation() [default6]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5619ab3888fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #59: + 0x150582 (0x5619ab3a3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #60: PyObject_Call + 0xbc (0x5619ab3a3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f51b1c22f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: dist.recv( [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f21f888cf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: return func(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default2]: pipeline_state.run_communication() [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5619ab38a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #62: + 0x150582 (0x5619ab3a3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f517ae60c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f9357076c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f9357099b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4b447a4f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #21: + 0x1445a6 (0x555988aaa5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #22: _PyObject_MakeTpCall + 0x26b (0x555988aa3a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:Traceback (most recent call last): [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:frame #63: PyObject_Call + 0xbc (0x5619ab3a3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc7cef93d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]: pipeline_state.run_communication() [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]: return forward_call(*args, **kwargs) [default0]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f21f888cf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #23: + 0x150866 (0x555988ab6866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: recv_activation_tensor = recv_activation() [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 782, in forward_with_hidden_states [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default7]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f675a7a9b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #1: + 0x589518e (0x7fc806f4d18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55a04ba8d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #12: + 0x5838439 (0x7f938de24439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: outputs = self.pipeline_engine.train_batch_iter( [default5]: return self._call_impl(*args, **kwargs) [default2]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4b447a4f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: dist.recv( [default3]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc8de942f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f21f888cf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x555988a9f142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: trainer.train(dataloader) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]: hidden_states = self.final_layer_norm(input=hidden_encoder_states["hidden_states"])["hidden_states"] [default7]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f675a75ef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f675a75ef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: dist.recv( [default2]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default3]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f92ebb57ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f675a75ef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f675a75ef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f517ae67c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]:frame #13: + 0x5843330 (0x7f938de2f330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc8de942f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc8de942f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default7]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f672399cc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f67239a3c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7863260d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default1]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5577f18d18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]: output = model(**micro_batch) [default0]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f21c1acac69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #1: + 0x589518e (0x7f91bff4a18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f67239c6b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #12: + 0x5838439 (0x7f675a751439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #13: + 0x5843330 (0x7f675a75c330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #14: + 0x58433c5 (0x7f675a75c3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #17: + 0x5849a84 (0x7f2b6044ca84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default5]:frame #14: + 0x58433c5 (0x7f938de2f3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #15: + 0x4e893cc (0x7f938d4753cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default2]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f4b447a4f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f92ebb58b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f92ebb0df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:frame #15: + 0x4e893cc (0x7f6759da23cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #16: + 0x1a08a88 (0x7f6756921a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #17: + 0x5849a84 (0x7f675a762a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f517ae8ab60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]: pipeline_state.run_communication() [default5]: return forward_call(*args, **kwargs) [default5]: return func(*args, **kwargs) [default3]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc8de942f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f4b0d9e2c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:Traceback (most recent call last): [default3]: return self._call_impl(*args, **kwargs) [default7]:frame #18: + 0x584ed35 (0x7f675a767d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #19: + 0xc97eee (0x7f676d019eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:frame #20: + 0x413ea4 (0x7f676c795ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:frame #1: + 0x589518e (0x7f789b21a18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]:frame #16: + 0x1a08a88 (0x7f9389ff4a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default2]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f91bff449a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #25: _PyFunction_Vectorcall + 0x6c (0x555988aaaa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:frame #21: + 0x1445a6 (0x564eeef205a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #22: _PyObject_MakeTpCall + 0x26b (0x564eeef19a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #23: + 0x150866 (0x564eeef2c866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: return func(*args, **kwargs) [default2]: return self._call_impl(*args, **kwargs) [default7]: return forward_call(*args, **kwargs) [default3]:frame #32: + 0x150582 (0x55a04baa8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #17: + 0x5849a84 (0x7f938de35a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f21c1ad1c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #26: PyObject_Call + 0xbc (0x555988ab6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x564eeef15142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #25: _PyFunction_Vectorcall + 0x6c (0x564eeef20a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #26: PyObject_Call + 0xbc (0x564eeef2cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x564eeef132b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #28: _PyFunction_Vectorcall + 0x6c (0x564eeef20a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default4]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fc8a7b80c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:torch.distributed.DistBackendError: [6] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '5:6', but store->get('5:6') got error: Connection reset by peer [default3]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default2]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x564eeef118fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #30: + 0x150582 (0x564eeef2c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x564eeef118fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #32: + 0x150582 (0x564eeef2c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #12: + 0x5838439 (0x7f51b1c15439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]: recv_activation_tensor = recv_activation() [default0]: return self._call_impl(*args, **kwargs) [default2]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f4b0d9e9c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f4b0da0cb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f92ebb0df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f92ebb0df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default7]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x564eeef118fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #34: + 0x150582 (0x564eeef2c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x564eeef118fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x564eeef18f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #37: _PyObject_Call_Prepend + 0x69 (0x564eeef2ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:Traceback (most recent call last): [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default4]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f789b2149a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f789b214ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:frame #18: + 0x584ed35 (0x7f938de3ad35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f21c1af4b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]: recv_activation_tensor = recv_activation() [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]:frame #38: + 0x211239 (0x564eeefed239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #39: _PyObject_MakeTpCall + 0x26b (0x564eeef19a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x564eeef153e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #41: _PyFunction_Vectorcall + 0x6c (0x564eeef20a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #18: + 0x584ed35 (0x7f2b60451d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:Traceback (most recent call last): [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default1]:frame #34: + 0x150582 (0x5577f18ec582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #19: + 0xc97eee (0x7f93a06eceee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:frame #12: + 0x5838439 (0x7f4b44797439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #18: + 0x584ed35 (0x7f4d8c4f0d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x564eeef10c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #43: _PyFunction_Vectorcall + 0x6c (0x564eeef20a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x564eeef118fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #45: + 0x150582 (0x564eeef2c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #46: PyObject_Call + 0xbc (0x564eeef2cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fc806f479a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5577f18d18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #20: + 0x413ea4 (0x7f939fe68ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]:frame #12: + 0x5838439 (0x7f21f887f439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #13: + 0x5843330 (0x7f21f888a330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #19: + 0xc97eee (0x7f4d9eda2eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]: dist.recv( [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default7]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x564eeef132b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #48: + 0x150582 (0x564eeef2c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #49: PyObject_Call + 0xbc (0x564eeef2cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x564eeef132b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #51: _PyFunction_Vectorcall + 0x6c (0x564eeef20a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:torch.distributed.DistBackendError: [4] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '3:4', but store->get('3:4') got error: Connection reset by peer [default3]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55a04ba8d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #21: + 0x1445a6 (0x55bb40d975a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55bb40d90a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: output = model(**micro_batch) [default5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]:frame #14: + 0x58433c5 (0x7f21f888a3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #13: + 0x5843330 (0x7f4b447a2330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fc8a7b87c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]: return func(*args, **kwargs) [default7]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x564eeef19007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #53: _PyObject_Call_Prepend + 0x69 (0x564eeef2ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #54: + 0x211239 (0x564eeefed239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #55: PyObject_Call + 0x207 (0x564eeef2d067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x564eeef132b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f789b215b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: recv_activation_tensor = recv_activation() [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default5]:torch.distributed.DistBackendError: [11] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '10:11', but store->get('10:11') got error: Connection reset by peer [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2af68f9d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x555988a9d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:frame #57: + 0x150582 (0x564eeef2c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x564eeef118fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #59: + 0x150582 (0x564eeef2c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #60: PyObject_Call + 0xbc (0x564eeef2cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f789b1caf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fc806f47ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: return func(*args, **kwargs) [default4]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:frame #23: + 0x150866 (0x55bb40da3866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]:frame #14: + 0x58433c5 (0x7f4b447a23c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: trainer.train(dataloader) [default2]: pipeline_state.run_communication() [default3]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]: outputs = self.pipeline_engine.train_batch_iter( [default7]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x564eeef132b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #62: + 0x150582 (0x564eeef2c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #63: PyObject_Call + 0xbc (0x564eeef2cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: trainer.train(dataloader) [default0]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]: return self._call_impl(*args, **kwargs) [default2]:frame #15: + 0x4e893cc (0x7f4b43de83cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default6]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f92ebb0df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default7]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default3]:frame #19: + 0xc97eee (0x7f2b72d03eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #20: + 0x413ea4 (0x7f2b7247fea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default3]:frame #34: + 0x150582 (0x55a04baa8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55bb40d8c142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: sharded_logits = self.model( [default3]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fc8a7baab60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #28: _PyFunction_Vectorcall + 0x6c (0x555988aaaa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x555988a9b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: pg.recv([tensor], group_src_rank, tag).wait() [default6]: trainer.train(dataloader) [default5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55bb40d97a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7feffe7aad87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default7]: dist.recv( [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbfbaafdd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #1: + 0x589518e (0x7fbff2ab718e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: return forward_call(*args, **kwargs) [default5]:frame #26: PyObject_Call + 0xbc (0x55bb40da3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]:frame #12: + 0x5838439 (0x7fc8de935439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f91bff44ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:torch.distributed.DistBackendError: [14] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '13:14', but store->get('13:14') got error: Connection reset by peer [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default2]:frame #13: + 0x5843330 (0x7f51b1c20330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #14: + 0x58433c5 (0x7f51b1c203c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: return self._call_impl(*args, **kwargs) [default5]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55bb40d8a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55bb40d97a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]:frame #16: + 0x1a08a88 (0x7f4b40967a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default6]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f92b4d4bc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default1]:frame #30: + 0x150582 (0x555988ab6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default7]: return func(*args, **kwargs) [default7]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fc806f48b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default5]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55bb40d888fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #30: + 0x150582 (0x55bb40da3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: return self._call_impl(*args, **kwargs) [default0]:frame #15: + 0x4e893cc (0x7f21f7ed03cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc806efdf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:Traceback (most recent call last): [default3]:Traceback (most recent call last): [default4]: return forward_call(*args, **kwargs) [default5]: return self._call_impl(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]:frame #13: + 0x5843330 (0x7fc8de940330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #20: + 0x413ea4 (0x7f4d9e51eea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7c411f5d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fbff2ab19a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55bb40d888fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #32: + 0x150582 (0x55bb40da3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55bb40d888fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default2]:frame #17: + 0x5849a84 (0x7f4b447a8a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: recv_activation_tensor = recv_activation() [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]: output = model(**micro_batch) [default0]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fbff2ab1ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #21: + 0x1445a6 (0x56108716f5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]:Traceback (most recent call last): [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]:frame #16: + 0x1a08a88 (0x7f21f4a4fa88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x555988a9b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:torch.distributed.DistBackendError: [15] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '14:15', but store->get('14:15') got error: Connection reset by peer [default7]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default3]:frame #22: _PyObject_MakeTpCall + 0x26b (0x561087168a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f789b1caf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5577f18d8f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]: return forward_call(*args, **kwargs) [default5]:frame #1: + 0x589518e (0x7ff03676418e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default1]:frame #37: _PyObject_Call_Prepend + 0x69 (0x5577f18eac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]: trainer.train(dataloader) [default5]: return forward_call(*args, **kwargs) [default4]: sharded_logits = self.model( [default3]:frame #14: + 0x58433c5 (0x7fc8de9403c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #1: + 0x589518e (0x7f2b2e8b318e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f91bff45b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f91bfefaf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #1: + 0x589518e (0x7f7c791af18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: outputs = self.pipeline_engine.train_batch_iter( [default7]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc806efdf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:frame #34: + 0x150582 (0x55bb40da3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]:frame #18: + 0x584ed35 (0x7f4b447add35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #17: + 0x5849a84 (0x7f21f8890a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f92b4d52c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #21: + 0x1445a6 (0x55bf556015a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8e855bbd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc806efdf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: return forward_call(*args, **kwargs) [default6]: trainer.train(dataloader) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:frame #15: + 0x4e893cc (0x7fc8ddf863cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default4]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f789b1caf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f789b1caf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #23: + 0x150866 (0x56108717b866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #38: + 0x211239 (0x5577f19ad239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #39: _PyObject_MakeTpCall + 0x26b (0x5577f18d9a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7ff03675e9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7ff03675ece2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:frame #1: + 0x589518e (0x7f8ebd57518e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc806efdf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]: trainer.train(dataloader) [default4]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]: return self._call_impl(*args, **kwargs) [default2]:frame #19: + 0xc97eee (0x7f4b5705feee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f92b4d75b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fc7d013bc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f7864408c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f786440fc5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default0]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]:frame #18: + 0x584ed35 (0x7f21f8895d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #32: + 0x150582 (0x555988ab6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default3]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x561087164142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default5]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55bb40d888fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: pipeline_state.run_communication() [default5]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7ff03675fb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]: sharded_logits = self.model( [default3]:frame #25: _PyFunction_Vectorcall + 0x6c (0x56108716fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:torch.distributed.DistBackendError: [13] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '12:13', but store->get('12:13') got error: Connection reset by peer [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default4]: return forward_call(*args, **kwargs) [default2]:frame #20: + 0x413ea4 (0x7f4b567dbea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f2b2e8ad9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f91bfefaf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: pipeline_state.run_communication() [default7]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fc7d0142c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55a04ba8d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]:frame #19: + 0xc97eee (0x7f220b147eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:frame #20: + 0x413ea4 (0x7f220a8c3ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x555988a9b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f7c791a99a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default2]:frame #15: + 0x4e893cc (0x7f51b12663cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fbff2ab2b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55bb40d8ff50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff036714f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f8ebd56f9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f7864432b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default3]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55a04ba94f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default5]: recv_activation_tensor = recv_activation() [default2]:frame #21: + 0x1445a6 (0x564de4a5f5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f2b2e8adce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f91bfefaf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f91bfefaf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f8ebd56fce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default7]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fc7d0165b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #26: PyObject_Call + 0xbc (0x56108717bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default2]: pipeline_state.run_communication() [default6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:frame #21: + 0x1445a6 (0x563e238395a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #16: + 0x1a08a88 (0x7fc8dab05a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55bf555faa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f7c791a9ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fbff2a67f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5331722d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55a04baa6c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default4]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55bb40da1c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #38: + 0x211239 (0x55bb40e64239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:frame #22: _PyObject_MakeTpCall + 0x26b (0x564de4a58a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #23: + 0x150866 (0x55bf5560d866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: recv_activation_tensor = recv_activation() [default2]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f7c791aab11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]: outputs = self.pipeline_engine.train_batch_iter( [default3]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff036714f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default2]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f9189138c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]: return forward_call(*args, **kwargs) [default2]:frame #16: + 0x1a08a88 (0x7f51adde5a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5610871622b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #28: _PyFunction_Vectorcall + 0x6c (0x56108716fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:frame #1: + 0x589518e (0x7f53696dc18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55bb40d90a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]:frame #22: _PyObject_MakeTpCall + 0x26b (0x563e23832a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f918913fc5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]:frame #12: + 0x5838439 (0x7fc806ef0439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #13: + 0x5843330 (0x7fc806efb330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f53696d69a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55bb40d8c3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff036714f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #23: + 0x150866 (0x564de4a6b866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f2b2e8aeb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5610871608fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #30: + 0x150582 (0x56108717b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #38: + 0x211239 (0x55a04bb69239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: outputs = self.pipeline_engine.train_batch_iter( [default0]: return self._call_impl(*args, **kwargs) [default5]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]:frame #23: + 0x150866 (0x563e23845866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #34: + 0x150582 (0x555988ab6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f8ebd570b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5610871608fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return self._call_impl(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff036714f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: dist.recv( [default7]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8ebd525f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]: dist.recv( [default0]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default1]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5577f18d53e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: outputs = self.pipeline_engine.train_batch_iter( [default5]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55bb40d97a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x563e2382e142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #17: + 0x5849a84 (0x7fc8de946a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f9189162b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:frame #17: + 0x5849a84 (0x7f51b1c26a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: recv_activation_tensor = recv_activation() [default5]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55bb40d87c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default0]: return forward_call(*args, **kwargs) [default5]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fefff952c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2b2e863f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8ebd525f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #32: + 0x150582 (0x56108717b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #12: + 0x5838439 (0x7f789b1bd439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55a04ba95a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default5]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55bb40d97a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fefff959c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #12: + 0x5838439 (0x7f92ebb00439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #13: + 0x5843330 (0x7f92ebb0b330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #14: + 0x58433c5 (0x7f92ebb0b3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7c7915ff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #13: + 0x5843330 (0x7f789b1c8330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55a04ba913e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f53696d6ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default0]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fefff97cb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #12: + 0x5838439 (0x7f91bfeed439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #13: + 0x5843330 (0x7f91bfef8330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default4]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]:frame #25: _PyFunction_Vectorcall + 0x6c (0x563e23839a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default7]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8ebd525f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5610871608fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #18: + 0x584ed35 (0x7f51b1c2bd35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #19: + 0xc97eee (0x7f51c44ddeee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default3]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:frame #12: + 0x5838439 (0x7ff036707439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7c7915ff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default3]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55a04ba9ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: dist.recv( [default0]: pipeline_state.run_communication() [default5]:frame #13: + 0x5843330 (0x7ff036712330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #14: + 0x58433c5 (0x7ff0367123c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2b2e863f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fbff2a67f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fbff2a67f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: pipeline_state.run_communication() [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]: return self._call_impl(*args, **kwargs) [default5]:frame #15: + 0x4e893cc (0x7ff035d583cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: outputs = self.pipeline_engine.train_batch_iter( [default5]: return func(*args, **kwargs) [default7]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8ebd525f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: output = model(**micro_batch) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]: output = model(**micro_batch) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]:frame #16: + 0x1a08a88 (0x7ff0328d7a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #15: + 0x4e893cc (0x7f92eb1513cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f8e86763c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]:frame #14: + 0x58433c5 (0x7f789b1c83c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: recv_activation_tensor = recv_activation() [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default5]: dist.recv( [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default2]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x564de4a54142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #16: + 0x1a08a88 (0x7f92e7cd0a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 786, in forward_with_hidden_states [default4]:frame #15: + 0x4e893cc (0x7f789a80e3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fbff2a67f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #41: _PyFunction_Vectorcall + 0x6c (0x5577f18e0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f53696d7b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55bb40d888fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:frame #25: _PyFunction_Vectorcall + 0x6c (0x564de4a5fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #14: + 0x58433c5 (0x7f91bfef83c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: fp32_sharded_logits = self.cast_to_fp32(x=sharded_logits)["output"] [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]:frame #34: + 0x150582 (0x56108717b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]: output = model(**micro_batch) [default4]: return forward_call(*args, **kwargs) [default7]:Traceback (most recent call last): [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default0]:frame #26: PyObject_Call + 0xbc (0x563e23845f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x555988a9b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7c7915ff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: return self._call_impl(*args, **kwargs) [default2]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]: pipeline_state.run_communication() [default4]: return func(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default0]: recv_activation_tensor = recv_activation() [default5]: return func(*args, **kwargs) [default0]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x563e2382c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #28: _PyFunction_Vectorcall + 0x6c (0x563e23839a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:Traceback (most recent call last): [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default5]:frame #17: + 0x5849a84 (0x7ff036718a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default5]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f7c7915ff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #14: + 0x58433c5 (0x7fc806efb3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]:frame #45: + 0x150582 (0x55bb40da3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x563e2382a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #18: + 0x584ed35 (0x7fc8de94bd35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #17: + 0x5849a84 (0x7f92ebb11a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]:frame #15: + 0x4e893cc (0x7fc8065413cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5610871608fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return forward_call(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default2]:frame #26: PyObject_Call + 0xbc (0x564de4a6bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x564de4a522b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]: return self._call_impl(*args, **kwargs) [default3]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x561087167f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: return func(*args, **kwargs) [default6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default0]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]: trainer.train(dataloader) [default2]:frame #28: _PyFunction_Vectorcall + 0x6c (0x564de4a5fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f7c4239dc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #37: _PyObject_Call_Prepend + 0x69 (0x561087179c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default5]:frame #46: PyObject_Call + 0xbc (0x55bb40da3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default0]:frame #30: + 0x150582 (0x563e23845582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default5]:torch.distributed.DistBackendError: [7] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '6:7', but store->get('6:7') got error: Connection reset by peer [default5]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default7]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f8e8676ac5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: outputs = self.pipeline_engine.train_batch_iter( [default5]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f536968cf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f536968cf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: pg.recv([tensor], group_src_rank, tag).wait() [default5]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x563e2382a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: trainer.train(dataloader) [default0]:frame #15: + 0x4e893cc (0x7f91bf53e3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: dist.recv( [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]: return self._call_impl(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default2]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x564de4a508fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55bf555f6142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default2]:frame #20: + 0x413ea4 (0x7f51c3c59ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #38: + 0x211239 (0x56108723c239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #39: _PyObject_MakeTpCall + 0x26b (0x561087168a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]: return self._call_impl(*args, **kwargs) [default4]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default4]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55bf55601a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fbfbbca5c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55a04ba8cc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]:torch.distributed.DistBackendError: [9] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '8:9', but store->get('8:9') got error: Connection reset by peer [default0]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default5]:frame #18: + 0x584ed35 (0x7ff03671dd35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #30: + 0x150582 (0x564de4a6b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #26: PyObject_Call + 0xbc (0x55bf5560df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff4a73cdd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default1]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x555988aa2f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2b2e863f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: dist.recv( [default7]:frame #16: + 0x1a08a88 (0x7fc8030c0a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]:torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default4]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55bf555f42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #18: + 0x584ed35 (0x7f92ebb16d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default4]:frame #16: + 0x1a08a88 (0x7f789738da88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5610871643e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #41: _PyFunction_Vectorcall + 0x6c (0x56108716fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f536968cf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]: output = model(**micro_batch) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]: pipeline_state.run_communication() [default3]:frame #19: + 0xc97eee (0x7fc8f11fdeee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:frame #16: + 0x1a08a88 (0x7f91bc0bda88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f8e8678db60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x56108715fc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #43: _PyFunction_Vectorcall + 0x6c (0x56108716fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: pg.recv([tensor], group_src_rank, tag).wait() [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:frame #20: + 0x413ea4 (0x7fc8f0979ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55bf55601a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #19: + 0xc97eee (0x7f92fe3c8eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]: pipeline_state.run_communication() [default5]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]: return forward_call(*args, **kwargs) [default5]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default3]:frame #21: + 0x1445a6 (0x55e6ee8d65a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default0]:frame #17: + 0x5849a84 (0x7f91bfefea84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #12: + 0x5838439 (0x7f8ebd518439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default7]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default0]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]: outputs = self.pipeline_engine.train_batch_iter( [default1]:frame #37: _PyObject_Call_Prepend + 0x69 (0x555988ab4c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: output = model(**micro_batch) [default3]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2b2e863f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #20: + 0x413ea4 (0x7f92fdb44ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:frame #21: + 0x1445a6 (0x5603c81975a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55bf555f28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #13: + 0x5843330 (0x7f8ebd523330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #17: + 0x5849a84 (0x7f789b1cea84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fbfbbcacc5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5610871608fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #45: + 0x150582 (0x56108717b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55a04ba9ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: sharded_logits = self.model( [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fcbb356cd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x564de4a508fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #38: + 0x211239 (0x555988b77239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]: sharded_logits = self.model( [default2]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default0]: sharded_logits = self.model( [default5]:frame #1: + 0x589518e (0x7fcbeb52618e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default2]: return func(*args, **kwargs) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default6]: recv_activation_tensor = recv_activation() [default0]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fbfbbccfb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #12: + 0x5838439 (0x7fbff2a5a439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f536968cf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: return self._call_impl(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default3]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f2af7aa1c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #14: + 0x58433c5 (0x7f8ebd5233c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #46: PyObject_Call + 0xbc (0x56108717bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6f3de2ed87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]: recv_activation_tensor = recv_activation() [default2]:frame #32: + 0x150582 (0x564de4a6b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #19: + 0xc97eee (0x7ff048fcfeee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:frame #22: _PyObject_MakeTpCall + 0x26b (0x5603c8190a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #15: + 0x4e893cc (0x7f8ebcb693cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fcbeb5209a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: outputs = self.pipeline_engine.train_batch_iter( [default6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:frame #30: + 0x150582 (0x55bf5560d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default3]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5610871622b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #48: + 0x150582 (0x56108717b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55a04ba8d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55bb40d8a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: dist.recv( [default6]: output = model(**micro_batch) [default0]:frame #32: + 0x150582 (0x563e23845582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x563e2382a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]:frame #16: + 0x1a08a88 (0x7f8eb96e8a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f7c423a4c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #21: + 0x1445a6 (0x55c6059fe5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:frame #1: + 0x589518e (0x7f6f75de818e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f6f75de29a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]:frame #34: + 0x150582 (0x563e23845582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]:frame #17: + 0x5849a84 (0x7fc806f01a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #13: + 0x5843330 (0x7fbff2a65330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #14: + 0x58433c5 (0x7fbff2a653c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f53328cac69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #48: + 0x150582 (0x55bb40da3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #49: PyObject_Call + 0xbc (0x55bb40da3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default2]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f7c423c7b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55c6059f7a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default4]: dist.recv( [default0]: return self._call_impl(*args, **kwargs) [default4]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f6f75de2ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f6f75de3b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default5]:frame #20: + 0x413ea4 (0x7ff04874bea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:frame #23: + 0x150866 (0x5603c81a3866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #17: + 0x5849a84 (0x7f8ebd529a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #49: PyObject_Call + 0xbc (0x56108717bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5610871622b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #45: + 0x150582 (0x55a04baa8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: recv_activation_tensor = recv_activation() [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fcbeb520ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x563e2382a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5603c818c142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:frame #15: + 0x4e893cc (0x7fbff20ab3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f53328d1c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fcbeb521b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: return func(*args, **kwargs) [default2]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x564de4a508fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #18: + 0x584ed35 (0x7f91bff03d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #19: + 0xc97eee (0x7f91d27b5eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]: return forward_call(*args, **kwargs) [default5]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f53328f4b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6f75d98f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55bb40d8a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default2]:frame #34: + 0x150582 (0x564de4a6b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #20: + 0x413ea4 (0x7f91d1f31ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:torch.distributed.DistBackendError: [6] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '5:6', but store->get('5:6') got error: Connection reset by peer [default7]:frame #18: + 0x584ed35 (0x7f8ebd52ed35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #18: + 0x584ed35 (0x7f789b1d3d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default4]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6f75d98f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fcbeb4d6f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55e6ee8cfa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]:frame #19: + 0xc97eee (0x7f8ecfde0eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:frame #23: + 0x150866 (0x55c605a0a866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default0]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:frame #23: + 0x150866 (0x55e6ee8e2866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #39: _PyObject_MakeTpCall + 0x26b (0x555988aa3a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x555988a9f3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]:frame #18: + 0x584ed35 (0x7fc806f06d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #19: + 0xc97eee (0x7fc8197b8eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]: pipeline_state.run_communication() [default6]: sharded_logits = self.model( [default4]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default3]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f2af7aa8c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default6]: dist.recv( [default7]:frame #20: + 0x413ea4 (0x7fc818f34ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #46: PyObject_Call + 0xbc (0x55a04baa8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55a04ba8f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]:torch.distributed.DistBackendError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '1:2', but store->get('1:2') got error: Connection reset by peer [default5]:frame #21: + 0x1445a6 (0x55b2760cb5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #1: + 0x589518e (0x7ff4df38718e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7ff4df3819a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #12: + 0x5838439 (0x7f7c79152439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: return func(*args, **kwargs) [default4]:frame #19: + 0xc97eee (0x7f78ada85eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:frame #20: + 0x413ea4 (0x7f78ad201ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:torch.distributed.DistBackendError: [13] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '12:13', but store->get('12:13') got error: Connection reset by peer [default0]: return forward_call(*args, **kwargs) [default5]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fcbeb4d6f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x564de4a508fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #41: _PyFunction_Vectorcall + 0x6c (0x555988aaaa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #20: + 0x413ea4 (0x7f8ecf55cea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:frame #21: + 0x1445a6 (0x55f2231e45a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #16: + 0x1a08a88 (0x7fbfeec2aa88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6f75d98f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]: return self._call_impl(*args, **kwargs) [default6]:frame #25: _PyFunction_Vectorcall + 0x6c (0x5603c8197a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbff6a73d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #1: + 0x589518e (0x7fc02ea2d18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #13: + 0x5843330 (0x7f7c7915d330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55c6059f3142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]: return func(*args, **kwargs) [default5]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55bb40d97a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fcbeb4d6f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default4]: dist.recv( [default5]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55b2760c4a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7ff4df381ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default0]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7faca5cfbd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default5]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fcbeb4d6f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x564de4a57f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x555988a9ac5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55f2231dda6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #21: + 0x1445a6 (0x5558053005a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: dist.recv( [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]:frame #1: + 0x589518e (0x7facddcb518e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x563e23831f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return forward_call(*args, **kwargs) [default7]:frame #23: + 0x150866 (0x55f2231f0866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #22: _PyObject_MakeTpCall + 0x26b (0x5558052f9a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default4]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6f75d98f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f6f3efd6c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fcbb4714c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]:frame #21: + 0x1445a6 (0x55ee64f815a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55ee64f7aa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: return func(*args, **kwargs) [default4]:frame #21: + 0x1445a6 (0x55be5ca835a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55be5ca7ca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: recv_activation_tensor = recv_activation() [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7facddcaf9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55e6ee8cb142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7ff4df382b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default2]:frame #14: + 0x58433c5 (0x7f7c7915d3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #51: _PyFunction_Vectorcall + 0x6c (0x56108716fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55c6059fea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55bb40d90007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: return func(*args, **kwargs) [default5]:frame #23: + 0x150866 (0x55b2760d7866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f2af7acbb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fc02ea279a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #26: PyObject_Call + 0xbc (0x5603c81a3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #43: _PyFunction_Vectorcall + 0x6c (0x555988aaaa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:frame #26: PyObject_Call + 0xbc (0x55c605a0af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:torch.distributed.DistBackendError: [13] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '12:13', but store->get('12:13') got error: Connection reset by peer [default5]:frame #12: + 0x5838439 (0x7f536967f439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: return self._call_impl(*args, **kwargs) [default5]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fcbb471bc5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55b2760c0142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default5]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff4df337f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff4df337f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #15: + 0x4e893cc (0x7f7c787a33cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #23: + 0x150866 (0x55be5ca8f866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #13: + 0x5843330 (0x7f536968a330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fcbb473eb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #37: _PyObject_Call_Prepend + 0x69 (0x564de4a69c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff4df337f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55bf555f28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #32: + 0x150582 (0x55bf5560d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #16: + 0x1a08a88 (0x7f7c75322a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #23: + 0x150866 (0x55580530c866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5558052f5142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #48: + 0x150582 (0x55a04baa8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f6f3efddc5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default4]: pg.recv([tensor], group_src_rank, tag).wait() [default5]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55b2760cba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #37: _PyObject_Call_Prepend + 0x69 (0x563e23843c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55e6ee8d6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #26: PyObject_Call + 0xbc (0x55e6ee8e2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default7]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55f2231d9142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #25: _PyFunction_Vectorcall + 0x6c (0x555805300a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55c6059f12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55c6059fea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7facddcafce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: sharded_logits = self.model( [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff4df337f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55f2231e4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #17: + 0x5849a84 (0x7f7c79163a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #17: + 0x5849a84 (0x7fbff2a6ba84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5577f18d0c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: return forward_call(*args, **kwargs) [default5]:frame #12: + 0x5838439 (0x7fcbeb4c9439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #26: PyObject_Call + 0xbc (0x55b2760d7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #38: + 0x211239 (0x563e23906239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x555988a9b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:frame #26: PyObject_Call + 0xbc (0x55580530cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x561087168007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #43: _PyFunction_Vectorcall + 0x6c (0x5577f18e0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #14: + 0x58433c5 (0x7f536968a3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f6f3f000b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7facddcb0b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default0]:frame #39: _PyObject_MakeTpCall + 0x26b (0x563e23832a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default4]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55bf555f28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #26: PyObject_Call + 0xbc (0x55f2231f0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55f2231d72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #53: _PyObject_Call_Prepend + 0x69 (0x561087179c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55c6059ef8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default0]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7facddc65f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x563e2382e3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #34: + 0x150582 (0x55bf5560d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: sharded_logits = self.model( [default0]:frame #23: + 0x150866 (0x55ee64f8d866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55ee64f76142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:torch.distributed.DistBackendError: [14] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '13:14', but store->get('13:14') got error: Connection reset by peer [default0]:frame #18: + 0x584ed35 (0x7fbff2a70d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #19: + 0xc97eee (0x7fc005322eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #49: PyObject_Call + 0xbc (0x55a04baa8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55a04ba8f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #12: + 0x5838439 (0x7f6f75d8b439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default3]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55e6ee8c92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: return self._call_impl(*args, **kwargs) [default7]: output = model(**micro_batch) [default1]:frame #45: + 0x150582 (0x555988ab6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55ee64f81a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #18: + 0x584ed35 (0x7f7c79168d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #54: + 0x211239 (0x56108723c239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: return func(*args, **kwargs) [default3]: return forward_call(*args, **kwargs) [default0]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7facddc65f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #41: _PyFunction_Vectorcall + 0x6c (0x563e23839a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55e6ee8d6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #26: PyObject_Call + 0xbc (0x55ee64f8df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #19: + 0xc97eee (0x7f7c8ba1aeee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:Traceback (most recent call last): [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default6]:torch.distributed.DistBackendError: [15] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '14:15', but store->get('14:15') got error: Connection reset by peer [default7]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5558052f32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #28: _PyFunction_Vectorcall + 0x6c (0x555805300a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc3ccc93d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fc02ea27ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x563e23829c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55e6ee8c78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #46: PyObject_Call + 0xbc (0x555988ab6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return self._call_impl(*args, **kwargs) [default0]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55ee64f742b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #43: _PyFunction_Vectorcall + 0x6c (0x563e23839a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x563e2382a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5603c818a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55b2760be2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7ff4a8575c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55b2760cba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55b2760bc8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #12: + 0x5838439 (0x7f2b2e856439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]:frame #13: + 0x5843330 (0x7f2b2e861330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #38: + 0x211239 (0x564de4b2c239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55bf555f28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55bf555f9f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #30: + 0x150582 (0x55e6ee8e2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7ff4a857cc5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:frame #14: + 0x58433c5 (0x7f2b2e8613c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:frame #39: _PyObject_MakeTpCall + 0x26b (0x564de4a58a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x564de4a543e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]:frame #41: _PyFunction_Vectorcall + 0x6c (0x564de4a5fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55bf5560bc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default5]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7ff4a859fb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:frame #30: + 0x150582 (0x55b2760d7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #12: + 0x5838439 (0x7ff4df32a439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x555988a9d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #48: + 0x150582 (0x555988ab6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55b2760bc8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fc02ea28b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #38: + 0x211239 (0x55bf556ce239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55bf555faa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55ee64f81a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55bf555f63e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x564de4a4fc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #43: _PyFunction_Vectorcall + 0x6c (0x564de4a5fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc02e9ddf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc02e9ddf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #45: + 0x150582 (0x563e23845582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #46: PyObject_Call + 0xbc (0x563e23845f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return self._call_impl(*args, **kwargs) [default2]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc02e9ddf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55e6ee8c78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc02e9ddf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55bf55601a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #32: + 0x150582 (0x55e6ee8e2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: return self._call_impl(*args, **kwargs) [default0]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55ee64f728fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55e6ee8c78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #34: + 0x150582 (0x55e6ee8e2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #15: + 0x4e893cc (0x7f2b2dea73cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #32: + 0x150582 (0x55b2760d7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #28: _PyFunction_Vectorcall + 0x6c (0x5603c8197a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x563e2382c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #49: PyObject_Call + 0xbc (0x555988ab6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #48: + 0x150582 (0x563e23845582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #13: + 0x5843330 (0x7ff4df335330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #16: + 0x1a08a88 (0x7f2b2aa26a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fbff7c1bc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default1]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x555988a9d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: return forward_call(*args, **kwargs) [default5]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55b2760bc8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:frame #17: + 0x5849a84 (0x7f2b2e867a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #49: PyObject_Call + 0xbc (0x563e23845f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x564de4a508fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fbff7c22c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default2]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fbff7c45b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x563e2382c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55e6ee8c78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55e6ee8cef50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #12: + 0x5838439 (0x7fc02e9d0439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: pipeline_state.run_communication() [default4]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55bf555f1c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55bf55601a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #51: _PyFunction_Vectorcall + 0x6c (0x563e23839a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55bf555f28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return forward_call(*args, **kwargs) [default0]:frame #30: + 0x150582 (0x55ee64f8d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55ee64f728fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #45: + 0x150582 (0x564de4a6b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #51: _PyFunction_Vectorcall + 0x6c (0x555988aaaa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #46: PyObject_Call + 0xbc (0x564de4a6bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default1]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x555988aa3007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55e6ee8e0c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #38: + 0x211239 (0x55e6ee9a3239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #13: + 0x5843330 (0x7fc02e9db330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #53: _PyObject_Call_Prepend + 0x69 (0x555988ab4c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return forward_call(*args, **kwargs) [default6]: recv_activation_tensor = recv_activation() [default0]:frame #32: + 0x150582 (0x55ee64f8d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x564de4a522b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x563e23832007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:frame #14: + 0x58433c5 (0x7ff4df3353c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5603c81888fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55ee64f728fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #48: + 0x150582 (0x564de4a6b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #34: + 0x150582 (0x55ee64f8d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55e6ee8cfa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55ee64f728fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:frame #54: + 0x211239 (0x555988b77239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #55: PyObject_Call + 0x207 (0x555988ab7067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default3]:frame #18: + 0x584ed35 (0x7f2b2e86cd35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #34: + 0x150582 (0x55b2760d7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #15: + 0x4e893cc (0x7ff4de97b3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #16: + 0x1a08a88 (0x7ff4db4faa88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55b2760bc8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55ee64f79f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #53: _PyObject_Call_Prepend + 0x69 (0x563e23843c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #54: + 0x211239 (0x563e23906239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #30: + 0x150582 (0x5603c81a3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x555988a9d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]: sharded_logits = self.model( [default5]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55b2760c3f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #19: + 0xc97eee (0x7f2b4111eeee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #20: + 0x413ea4 (0x7f2b4089aea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55e6ee8cb3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #45: + 0x150582 (0x55bf5560d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #46: PyObject_Call + 0xbc (0x55bf5560df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #57: + 0x150582 (0x555988ab6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55e6ee8d6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55ee64f8bc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5603c81888fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55bf555f42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #49: PyObject_Call + 0xbc (0x564de4a6bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #48: + 0x150582 (0x55bf5560d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #21: + 0x1445a6 (0x55fbc6ecf5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default0]:frame #55: PyObject_Call + 0x207 (0x563e23846067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #32: + 0x150582 (0x5603c81a3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5603c81888fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:frame #34: + 0x150582 (0x5603c81a3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #38: + 0x211239 (0x55ee6504e239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x564de4a522b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55b2760d5c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #14: + 0x58433c5 (0x7fc02e9db3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #15: + 0x4e893cc (0x7fc02e0213cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]:frame #16: + 0x1a08a88 (0x7fc02aba0a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: return self._call_impl(*args, **kwargs) [default6]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5603c81888fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x563e2382c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55ee64f7aa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55ee64f763e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #57: + 0x150582 (0x563e23845582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55ee64f81a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #17: + 0x5849a84 (0x7ff4df33ba84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5603c818ff50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55e6ee8c6c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #37: _PyObject_Call_Prepend + 0x69 (0x5603c81a1c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]:frame #38: + 0x211239 (0x5603c8264239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x563e2382a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #59: + 0x150582 (0x563e23845582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #17: + 0x5849a84 (0x7fc02e9e1a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]: return forward_call(*args, **kwargs) [default0]:frame #60: PyObject_Call + 0xbc (0x563e23845f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #49: PyObject_Call + 0xbc (0x55bf5560df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55bf555f42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #38: + 0x211239 (0x55b276198239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55ee64f71c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55fbc6ec8a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #51: _PyFunction_Vectorcall + 0x6c (0x564de4a5fa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #23: + 0x150866 (0x55fbc6edb866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55e6ee8d6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #39: _PyObject_MakeTpCall + 0x26b (0x5603c8190a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x563e2382c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return self._call_impl(*args, **kwargs) [default2]:frame #18: + 0x584ed35 (0x7fc02e9e6d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #19: + 0xc97eee (0x7fc041298eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55b2760c4a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55bf55601a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x564de4a58007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55bf555fa007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x555988a9b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55e6ee8c78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]:frame #62: + 0x150582 (0x563e23845582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:frame #20: + 0x413ea4 (0x7fc040a14ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]: dist.recv( [default6]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5603c818c3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default5]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55b2760c03e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55b2760cba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55fbc6ec4142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55fbc6ecfa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #53: _PyObject_Call_Prepend + 0x69 (0x564de4a69c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #54: + 0x211239 (0x564de4b2c239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return forward_call(*args, **kwargs) [default3]:frame #45: + 0x150582 (0x55e6ee8e2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #59: + 0x150582 (0x555988ab6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #63: PyObject_Call + 0xbc (0x563e23845f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55ee64f81a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55b2760bbc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #26: PyObject_Call + 0xbc (0x55fbc6edbf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55b2760cba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #41: _PyFunction_Vectorcall + 0x6c (0x5603c8197a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5603c8187c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]:frame #43: _PyFunction_Vectorcall + 0x6c (0x5603c8197a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default4]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55bf5560bc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #46: PyObject_Call + 0xbc (0x55e6ee8e2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #18: + 0x584ed35 (0x7ff4df340d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #55: PyObject_Call + 0x207 (0x564de4a6c067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #19: + 0xc97eee (0x7ff4f1bf2eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5603c81888fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x564de4a522b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55b2760bc8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #54: + 0x211239 (0x55bf556ce239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #45: + 0x150582 (0x55b2760d7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:frame #55: PyObject_Call + 0x207 (0x55bf5560e067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55fbc6ec22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #46: PyObject_Call + 0xbc (0x55b2760d7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #45: + 0x150582 (0x5603c81a3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55e6ee8c92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #21: + 0x1445a6 (0x55635793e5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #48: + 0x150582 (0x55e6ee8e2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return func(*args, **kwargs) [default2]:frame #22: _PyObject_MakeTpCall + 0x26b (0x556357937a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #49: PyObject_Call + 0xbc (0x55e6ee8e2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55e6ee8c92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55bf555f42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55e6ee8d6a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default7]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]: return forward_call(*args, **kwargs) [default1]:frame #60: PyObject_Call + 0xbc (0x555988ab6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55b2760be2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #23: + 0x150866 (0x55635794a866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #48: + 0x150582 (0x55b2760d7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #20: + 0x413ea4 (0x7ff4f136eea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55e6ee8cf007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55fbc6ecfa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55fbc6ec08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #49: PyObject_Call + 0xbc (0x55b2760d7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x556357933142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55ee64f728fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:torch.distributed.DistBackendError: [11] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '10:11', but store->get('10:11') got error: Connection reset by peer [default3]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55e6ee8e0c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55b2760be2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55b2760cba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: pipeline_state.run_communication() [default6]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default1]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x555988a9d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #62: + 0x150582 (0x555988ab6582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #54: + 0x211239 (0x55e6ee9a3239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #21: + 0x1445a6 (0x558580c535a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #57: + 0x150582 (0x564de4a6b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #45: + 0x150582 (0x55ee64f8d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #46: PyObject_Call + 0xbc (0x55ee64f8df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f26b560fd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #46: PyObject_Call + 0xbc (0x5603c81a3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #55: PyObject_Call + 0x207 (0x55e6ee8e3067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5603c818a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55b2760c4007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #22: _PyObject_MakeTpCall + 0x26b (0x558580c4ca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #30: + 0x150582 (0x55fbc6edb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default3]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55fbc6ec08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55e6ee8c92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #63: PyObject_Call + 0xbc (0x555988ab6f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default6]:frame #1: + 0x589518e (0x7f26ed5c918e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #23: + 0x150866 (0x558580c5f866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x564de4a508fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:frame #59: + 0x150582 (0x564de4a6b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #48: + 0x150582 (0x5603c81a3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #60: PyObject_Call + 0xbc (0x564de4a6bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55ee64f742b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55b2760d5c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x558580c48142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #54: + 0x211239 (0x55b276198239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]: pipeline_state.run_communication() [default7]: recv_activation_tensor = recv_activation() [default2]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x564de4a522b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #57: + 0x150582 (0x55bf5560d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #62: + 0x150582 (0x564de4a6b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #32: + 0x150582 (0x55fbc6edb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f26ed5c39a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55635793ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #57: + 0x150582 (0x55e6ee8e2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55e6ee8c78fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #49: PyObject_Call + 0xbc (0x5603c81a3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5603c818a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #59: + 0x150582 (0x55e6ee8e2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f26ed5c3ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #63: PyObject_Call + 0xbc (0x564de4a6bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #51: _PyFunction_Vectorcall + 0x6c (0x5603c8197a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55fbc6ec08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:frame #34: + 0x150582 (0x55fbc6edb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: recv_activation_tensor = recv_activation() [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]:frame #48: + 0x150582 (0x55ee64f8d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #60: PyObject_Call + 0xbc (0x55e6ee8e2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #55: PyObject_Call + 0x207 (0x55b2760d8067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5603c8190007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55bf555f28fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55fbc6ec08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55e6ee8c92b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55b2760be2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55fbc6ec7f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]:frame #53: _PyObject_Call_Prepend + 0x69 (0x5603c81a1c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #54: + 0x211239 (0x5603c8264239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f26ed5c4b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #49: PyObject_Call + 0xbc (0x55ee64f8df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #57: + 0x150582 (0x55b2760d7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55b2760bc8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f26ed579f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #26: PyObject_Call + 0xbc (0x55635794af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #62: + 0x150582 (0x55e6ee8e2582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5563579312b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f26ed579f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55ee64f742b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #63: PyObject_Call + 0xbc (0x55e6ee8e2f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f26ed579f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55fbc6ed9c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:frame #38: + 0x211239 (0x55fbc6f9c239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #59: + 0x150582 (0x55b2760d7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55ee64f81a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f26ed579f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f26b67b7c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default5]:frame #25: _PyFunction_Vectorcall + 0x6c (0x558580c53a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #60: PyObject_Call + 0xbc (0x55b2760d7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #59: + 0x150582 (0x55bf5560d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #55: PyObject_Call + 0x207 (0x5603c81a4067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f26b67bec5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55fbc6ec8a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: dist.recv( [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default3]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55fbc6ec43e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f26b67e1b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55b2760be2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55635793ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return func(*args, **kwargs) [default7]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:frame #62: + 0x150582 (0x55b2760d7582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55ee64f7a007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default7]: pg.recv([tensor], group_src_rank, tag).wait() [default0]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55ee64f8bc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #63: PyObject_Call + 0xbc (0x55b2760d7f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55635792f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #12: + 0x5838439 (0x7f26ed56c439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #60: PyObject_Call + 0xbc (0x55bf5560df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5603c818a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55fbc6ecfa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #26: PyObject_Call + 0xbc (0x558580c5ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]:frame #13: + 0x5843330 (0x7f26ed577330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #14: + 0x58433c5 (0x7f26ed5773c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #15: + 0x4e893cc (0x7f26ecbbd3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55bf555f42b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #16: + 0x1a08a88 (0x7f26e973ca88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #17: + 0x5849a84 (0x7f26ed57da84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #54: + 0x211239 (0x55ee6504e239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:torch.distributed.DistBackendError: [11] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '10:11', but store->get('10:11') got error: Connection reset by peer [default7]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default0]:frame #55: PyObject_Call + 0x207 (0x55ee64f8e067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fceca6b4d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]: dist.recv( [default7]:frame #1: + 0x589518e (0x7fcf0266e18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #62: + 0x150582 (0x55bf5560d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fcf026689a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fcf02668ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #30: + 0x150582 (0x55635794a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fcf02669b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55fbc6ebfc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fcf0261ef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #57: + 0x150582 (0x5603c81a3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fcf0261ef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #18: + 0x584ed35 (0x7f26ed582d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55635792f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fcf0261ef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x558580c462b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #19: + 0xc97eee (0x7f26ffe34eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:frame #63: PyObject_Call + 0xbc (0x55bf5560df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fcf0261ef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default6]:frame #20: + 0x413ea4 (0x7f26ff5b0ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:frame #32: + 0x150582 (0x55635794a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fcecb85cc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #28: _PyFunction_Vectorcall + 0x6c (0x558580c53a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #21: + 0x1445a6 (0x55ee681415a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5603c81888fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fcecb863c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55fbc6ecfa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55fbc6ec08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55ee6813aa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default2]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55635792f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fcecb886b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55ee64f742b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #23: + 0x150866 (0x55ee6814d866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #57: + 0x150582 (0x55ee64f8d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55ee68136142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return func(*args, **kwargs) [default6]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55ee68141a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #45: + 0x150582 (0x55fbc6edb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #26: PyObject_Call + 0xbc (0x55ee6814df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55ee681342b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #34: + 0x150582 (0x55635794a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #12: + 0x5838439 (0x7fcf02611439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x558580c448fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55ee68141a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55ee681328fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #59: + 0x150582 (0x5603c81a3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #30: + 0x150582 (0x55ee6814d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55ee681328fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #46: PyObject_Call + 0xbc (0x55fbc6edbf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #32: + 0x150582 (0x55ee6814d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default6]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55ee681328fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #13: + 0x5843330 (0x7fcf0261c330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55ee64f728fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #34: + 0x150582 (0x55ee6814d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55635792f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #14: + 0x58433c5 (0x7fcf0261c3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #30: + 0x150582 (0x558580c5f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55ee681328fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #60: PyObject_Call + 0xbc (0x5603c81a3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #15: + 0x4e893cc (0x7fcf01c623cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #59: + 0x150582 (0x55ee64f8d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x556357936f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55ee68139f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55fbc6ec22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #16: + 0x1a08a88 (0x7fcefe7e1a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #17: + 0x5849a84 (0x7fcf02622a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #48: + 0x150582 (0x55fbc6edb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55ee6814bc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #37: _PyObject_Call_Prepend + 0x69 (0x556357948c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #38: + 0x211239 (0x55ee6820e239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #18: + 0x584ed35 (0x7fcf02627d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5603c818a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55ee6813aa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #49: PyObject_Call + 0xbc (0x55fbc6edbf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #19: + 0xc97eee (0x7fcf14ed9eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x558580c448fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #20: + 0x413ea4 (0x7fcf14655ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:frame #21: + 0x1445a6 (0x5580eecb95a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #60: PyObject_Call + 0xbc (0x55ee64f8df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #22: _PyObject_MakeTpCall + 0x26b (0x5580eecb2a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #23: + 0x150866 (0x5580eecc5866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55ee64f742b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55ee681363e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #62: + 0x150582 (0x5603c81a3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5580eecae142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #63: PyObject_Call + 0xbc (0x5603c81a3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55ee68141a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:frame #32: + 0x150582 (0x558580c5f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #62: + 0x150582 (0x55ee64f8d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #25: _PyFunction_Vectorcall + 0x6c (0x5580eecb9a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #38: + 0x211239 (0x556357a0b239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55ee68131c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:torch.distributed.DistBackendError: [7] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '6:7', but store->get('6:7') got error: Connection reset by peer [default7]:frame #26: PyObject_Call + 0xbc (0x5580eecc5f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default0]:frame #63: PyObject_Call + 0xbc (0x55ee64f8df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55ee68141a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55fbc6ec22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5580eecac2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55fbc6ecfa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55ee681328fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #45: + 0x150582 (0x55ee6814d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #39: _PyObject_MakeTpCall + 0x26b (0x556357937a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #46: PyObject_Call + 0xbc (0x55ee6814df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55ee681342b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5563579333e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #28: _PyFunction_Vectorcall + 0x6c (0x5580eecb9a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55635793ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5580eecaa8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #48: + 0x150582 (0x55ee6814d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #49: PyObject_Call + 0xbc (0x55ee6814df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55ee681342b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55635792ec5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55ee68141a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55ee6813a007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8a2895ad87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55ee6814bc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x558580c448fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #30: + 0x150582 (0x5580eecc5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5580eecaa8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55635793ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #32: + 0x150582 (0x5580eecc5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55635792f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #54: + 0x211239 (0x55ee6820e239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #34: + 0x150582 (0x558580c5f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5580eecaa8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x558580c448fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #34: + 0x150582 (0x5580eecc5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #55: PyObject_Call + 0x207 (0x55ee6814e067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #45: + 0x150582 (0x55635794a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55ee681342b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #57: + 0x150582 (0x55ee6814d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55fbc6ec8007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5580eecaa8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x558580c4bf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5580eecb1f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #1: + 0x589518e (0x7f8a6091418e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55fbc6ed9c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #37: _PyObject_Call_Prepend + 0x69 (0x5580eecc3c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #46: PyObject_Call + 0xbc (0x55635794af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55ee681328fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f8a6090e9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #37: _PyObject_Call_Prepend + 0x69 (0x558580c5dc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #59: + 0x150582 (0x55ee6814d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #60: PyObject_Call + 0xbc (0x55ee6814df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #54: + 0x211239 (0x55fbc6f9c239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #38: + 0x211239 (0x5580eed86239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f8a6090ece2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55ee681342b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #55: PyObject_Call + 0x207 (0x55fbc6edc067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #39: _PyObject_MakeTpCall + 0x26b (0x5580eecb2a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5563579312b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #62: + 0x150582 (0x55ee6814d582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #38: + 0x211239 (0x558580d20239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5580eecae3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f8a6090fb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #63: PyObject_Call + 0xbc (0x55ee6814df1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55fbc6ec22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #41: _PyFunction_Vectorcall + 0x6c (0x5580eecb9a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #48: + 0x150582 (0x55635794a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default7]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8a608c4f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5580eeca9c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #43: _PyFunction_Vectorcall + 0x6c (0x5580eecb9a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5580eecaa8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #39: _PyObject_MakeTpCall + 0x26b (0x558580c4ca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #45: + 0x150582 (0x5580eecc5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #46: PyObject_Call + 0xbc (0x5580eecc5f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5580eecac2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #57: + 0x150582 (0x55fbc6edb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #48: + 0x150582 (0x5580eecc5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #49: PyObject_Call + 0xbc (0x5580eecc5f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5580eecac2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #51: _PyFunction_Vectorcall + 0x6c (0x5580eecb9a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8a608c4f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5580eecb2007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #53: _PyObject_Call_Prepend + 0x69 (0x5580eecc3c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #54: + 0x211239 (0x5580eed86239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #55: PyObject_Call + 0x207 (0x5580eecc6067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5580eecac2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #49: PyObject_Call + 0xbc (0x55635794af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #57: + 0x150582 (0x5580eecc5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5580eecaa8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #59: + 0x150582 (0x5580eecc5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #60: PyObject_Call + 0xbc (0x5580eecc5f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x558580c483e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5580eecac2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #62: + 0x150582 (0x5580eecc5582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #63: PyObject_Call + 0xbc (0x5580eecc5f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55fbc6ec08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8a608c4f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default2]:frame #20: + 0x413ea4 (0x7f7c8b196ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:frame #21: + 0x1445a6 (0x56244cb325a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55f2231e4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fef193d5d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5563579312b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55f2231d58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #41: _PyFunction_Vectorcall + 0x6c (0x558580c53a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0ac7e33d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: + 0x589518e (0x7f0affded18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #59: + 0x150582 (0x55fbc6edb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #30: + 0x150582 (0x55f2231f0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f8a608c4f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55f2231d58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x558580c43c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #22: _PyObject_MakeTpCall + 0x26b (0x56244cb2ba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #60: PyObject_Call + 0xbc (0x55fbc6edbf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #1: + 0x589518e (0x7fef5138f18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55635793ea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fef513899a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f0affde79a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f8a29b02c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]: trainer.train(dataloader) [default5]:frame #43: _PyFunction_Vectorcall + 0x6c (0x558580c53a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #32: + 0x150582 (0x55f2231f0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fef51389ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x556357937007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #23: + 0x150866 (0x56244cb3e866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f8a29b09c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55f2231d58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f8a29b2cb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x56244cb27142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #53: _PyObject_Call_Prepend + 0x69 (0x556357948c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:Traceback (most recent call last): [default5]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x558580c448fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:frame #34: + 0x150582 (0x55f2231f0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55fbc6ec22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55f2231d58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #25: _PyFunction_Vectorcall + 0x6c (0x56244cb32a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #62: + 0x150582 (0x55fbc6edb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: trainer.train(dataloader) [default3]:frame #63: PyObject_Call + 0xbc (0x55fbc6edbf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #12: + 0x5838439 (0x7f8a608b7439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55f2231dcf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #26: PyObject_Call + 0xbc (0x56244cb3ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x56244cb252b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #54: + 0x211239 (0x556357a0b239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #28: _PyFunction_Vectorcall + 0x6c (0x56244cb32a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default6]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f0affde7ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default6]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f0affde8b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #55: PyObject_Call + 0x207 (0x55635794b067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x56244cb238fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fef5138ab11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #45: + 0x150582 (0x558580c5f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #46: PyObject_Call + 0xbc (0x558580c5ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55f2231eec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x558580c462b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5563579312b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default2]:frame #57: + 0x150582 (0x55635794a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fef5133ff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0affd9df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #13: + 0x5843330 (0x7f8a608c2330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0affd9df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55635792f8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default7]:frame #14: + 0x58433c5 (0x7f8a608c23c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]:frame #38: + 0x211239 (0x55f2232b1239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #59: + 0x150582 (0x55635794a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0affd9df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0affd9df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #15: + 0x4e893cc (0x7f8a5ff083cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default2]:frame #60: PyObject_Call + 0xbc (0x55635794af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: outputs = self.pipeline_engine.train_batch_iter( [default7]:frame #16: + 0x1a08a88 (0x7f8a5ca87a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55f2231dda6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55f2231d93e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5563579312b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55f2231e4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #62: + 0x150582 (0x55635794a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default5]:frame #48: + 0x150582 (0x558580c5f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55f2231d4c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fef5133ff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #63: PyObject_Call + 0xbc (0x55635794af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f0ac8fdbc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #17: + 0x5849a84 (0x7f8a608c8a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]: outputs = self.pipeline_engine.train_batch_iter( [default5]:frame #49: PyObject_Call + 0xbc (0x558580c5ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55f2231e4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fef5133ff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x558580c462b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #51: _PyFunction_Vectorcall + 0x6c (0x558580c53a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #30: + 0x150582 (0x56244cb3e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default2]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x56244cb238fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default5]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x558580c4c007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #53: _PyObject_Call_Prepend + 0x69 (0x558580c5dc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:frame #54: + 0x211239 (0x558580d20239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #32: + 0x150582 (0x56244cb3e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #55: PyObject_Call + 0x207 (0x558580c60067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]:frame #18: + 0x584ed35 (0x7f8a608cdd35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fef5133ff81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x558580c462b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #57: + 0x150582 (0x558580c5f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fef1a57dc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x56244cb238fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: output = model(**micro_batch) [default5]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x558580c448fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default5]:frame #59: + 0x150582 (0x558580c5f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #60: PyObject_Call + 0xbc (0x558580c5ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55f2231d58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default5]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x558580c462b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]:frame #19: + 0xc97eee (0x7f8a7317feee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:frame #20: + 0x413ea4 (0x7f8a728fbea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:frame #45: + 0x150582 (0x55f2231f0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #62: + 0x150582 (0x558580c5f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fef1a584c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f0ac8fe2c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]: return self._call_impl(*args, **kwargs) [default7]:frame #21: + 0x1445a6 (0x556815caf5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default5]:frame #63: PyObject_Call + 0xbc (0x558580c5ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f0ac9005b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #22: _PyObject_MakeTpCall + 0x26b (0x556815ca8a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #12: + 0x5838439 (0x7f0affd90439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fef1a5a7b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]: output = model(**micro_batch) [default7]:frame #23: + 0x150866 (0x556815cbb866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x556815ca4142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #25: _PyFunction_Vectorcall + 0x6c (0x556815cafa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]:frame #26: PyObject_Call + 0xbc (0x556815cbbf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x556815ca22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #28: _PyFunction_Vectorcall + 0x6c (0x556815cafa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #34: + 0x150582 (0x56244cb3e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x56244cb238fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x556815ca08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #30: + 0x150582 (0x556815cbb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x556815ca08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #32: + 0x150582 (0x556815cbb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #13: + 0x5843330 (0x7f0affd9b330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x556815ca08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #34: + 0x150582 (0x556815cbb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x556815ca08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x556815ca7f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #46: PyObject_Call + 0xbc (0x55f2231f0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #37: _PyObject_Call_Prepend + 0x69 (0x556815cb9c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #38: + 0x211239 (0x556815d7c239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #39: _PyObject_MakeTpCall + 0x26b (0x556815ca8a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x556815ca43e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55f2231d72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #41: _PyFunction_Vectorcall + 0x6c (0x556815cafa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x556815c9fc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #43: _PyFunction_Vectorcall + 0x6c (0x556815cafa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x556815ca08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #48: + 0x150582 (0x55f2231f0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #45: + 0x150582 (0x556815cbb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #46: PyObject_Call + 0xbc (0x556815cbbf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x556815ca22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #48: + 0x150582 (0x556815cbb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #14: + 0x58433c5 (0x7f0affd9b3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #49: PyObject_Call + 0xbc (0x556815cbbf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x556815ca22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #51: _PyFunction_Vectorcall + 0x6c (0x556815cafa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x556815ca8007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x56244cb2af50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #53: _PyObject_Call_Prepend + 0x69 (0x556815cb9c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #54: + 0x211239 (0x556815d7c239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #55: PyObject_Call + 0x207 (0x556815cbc067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: return forward_call(*args, **kwargs) [default4]: return self._call_impl(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x556815ca22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #57: + 0x150582 (0x556815cbb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x556815ca08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #59: + 0x150582 (0x556815cbb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: return forward_call(*args, **kwargs) [default2]:frame #37: _PyObject_Call_Prepend + 0x69 (0x56244cb3cc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #60: PyObject_Call + 0xbc (0x556815cbbf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x556815ca22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #62: + 0x150582 (0x556815cbb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #63: PyObject_Call + 0xbc (0x556815cbbf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #38: + 0x211239 (0x56244cbff239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default6]:frame #15: + 0x4e893cc (0x7f0aff3e13cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #16: + 0x1a08a88 (0x7f0afbf60a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #30: + 0x150582 (0x55c605a0a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default2]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55c6059ef8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #39: _PyObject_MakeTpCall + 0x26b (0x56244cb2ba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55be5ca78142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #20: + 0x413ea4 (0x7fc004a9eea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:frame #17: + 0x5849a84 (0x7f0affda1a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5558052f18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #49: PyObject_Call + 0xbc (0x55f2231f0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #30: + 0x150582 (0x55580530c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: sharded_logits = self.model( [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55be5ca83a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #55: PyObject_Call + 0x207 (0x56108717c067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #21: + 0x1445a6 (0x55add12895a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55add1282a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #18: + 0x584ed35 (0x7f0affda6d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5558052f18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #32: + 0x150582 (0x55580530c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #19: + 0xc97eee (0x7f0b12658eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]: sharded_logits = self.model( [default7]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55f2231d72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default2]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x56244cb273e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #23: + 0x150866 (0x55add1295866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55add127e142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #41: _PyFunction_Vectorcall + 0x6c (0x56244cb32a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: return self._call_impl(*args, **kwargs) [default3]:frame #12: + 0x5838439 (0x7fef51332439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]:frame #13: + 0x5843330 (0x7fef5133d330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default3]:frame #14: + 0x58433c5 (0x7fef5133d3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5558052f18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x56244cb22c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #34: + 0x150582 (0x55580530c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55f2231e4a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default3]:frame #15: + 0x4e893cc (0x7fef509833cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: output = model(**micro_batch) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55f2231dd007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55add1289a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55f2231eec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #26: PyObject_Call + 0xbc (0x55add1295f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]:frame #20: + 0x413ea4 (0x7f0b11dd4ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:frame #43: _PyFunction_Vectorcall + 0x6c (0x56244cb32a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default2]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x56244cb238fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #16: + 0x1a08a88 (0x7fef4d502a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55add127c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5610871622b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #57: + 0x150582 (0x56108717b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: return self._call_impl(*args, **kwargs) [default7]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5558052f18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #45: + 0x150582 (0x56244cb3e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55add1289a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55add127a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #17: + 0x5849a84 (0x7fef51343a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #26: PyObject_Call + 0xbc (0x55be5ca8ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]:frame #21: + 0x1445a6 (0x5631d63c55a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #30: + 0x150582 (0x55add1295582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #18: + 0x584ed35 (0x7fef51348d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55add127a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5610871608fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #19: + 0xc97eee (0x7fef63bfaeee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]:frame #46: PyObject_Call + 0xbc (0x56244cb3ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #32: + 0x150582 (0x55c605a0a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55c6059ef8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: return forward_call(*args, **kwargs) [default4]: return forward_call(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default3]:frame #59: + 0x150582 (0x56108717b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #54: + 0x211239 (0x55f2232b1239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: return forward_call(*args, **kwargs) [default7]:frame #55: _PyObject_MakeTpCall + 0x26b (0x55f2231dda6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #20: + 0x413ea4 (0x7fef63376ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x56244cb252b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #22: _PyObject_MakeTpCall + 0x26b (0x5631d63bea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default6]:frame #23: + 0x150866 (0x5631d63d1866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:frame #34: + 0x150582 (0x55c605a0a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55c6059ef8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #21: + 0x1445a6 (0x563dfa5155a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55be5ca762b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55be5ca83a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5631d63ba142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #25: _PyFunction_Vectorcall + 0x6c (0x5631d63c5a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5558052f8f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #22: _PyObject_MakeTpCall + 0x26b (0x563dfa50ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55580530ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default3]:frame #60: PyObject_Call + 0xbc (0x56108717bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5610871622b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #56: _PyEval_EvalFrameDefault + 0x5723 (0x55f2231d9c53 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55c6059f6f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #48: + 0x150582 (0x56244cb3e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #23: + 0x150866 (0x563dfa521866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #38: + 0x211239 (0x5558053cd239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #26: PyObject_Call + 0xbc (0x5631d63d1f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5631d63b82b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #39: _PyObject_MakeTpCall + 0x26b (0x5558052f9a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 786, in forward_with_hidden_states [default3]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x563dfa50a142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default3]:frame #25: _PyFunction_Vectorcall + 0x6c (0x563dfa515a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55c605a08c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #38: + 0x211239 (0x55c605acb239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: fp32_sharded_logits = self.cast_to_fp32(x=sharded_logits)["output"] [default7]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5558052f53e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #49: PyObject_Call + 0xbc (0x56244cb3ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #26: PyObject_Call + 0xbc (0x563dfa521f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 786, in forward_with_hidden_states [default7]:frame #41: _PyFunction_Vectorcall + 0x6c (0x555805300a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x56244cb252b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #57: + 0x150582 (0x55f2231f0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55f2231d58fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55be5ca748fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: fp32_sharded_logits = self.cast_to_fp32(x=sharded_logits)["output"] [default4]:frame #30: + 0x150582 (0x55be5ca8f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #32: + 0x150582 (0x55add1295582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return forward_call(*args, **kwargs) [default2]:frame #51: _PyFunction_Vectorcall + 0x6c (0x56244cb32a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default3]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x563dfa5082b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #59: + 0x150582 (0x55f2231f0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: sharded_logits = self.model( [default2]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x56244cb2b007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5558052f0c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #43: _PyFunction_Vectorcall + 0x6c (0x555805300a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #53: _PyObject_Call_Prepend + 0x69 (0x56244cb3cc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: return self._call_impl(*args, **kwargs) [default7]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5558052f18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55add127a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #34: + 0x150582 (0x55add1295582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: return self._call_impl(*args, **kwargs) [default3]:frame #62: + 0x150582 (0x56108717b582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #60: PyObject_Call + 0xbc (0x55f2231f0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55f2231d72b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: return forward_call(*args, **kwargs) [default2]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55c6059f7a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55add127a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]: pipeline_state.run_communication() [default2]:frame #54: + 0x211239 (0x56244cbff239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #28: _PyFunction_Vectorcall + 0x6c (0x563dfa515a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55be5ca748fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #63: PyObject_Call + 0xbc (0x56108717bf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]:frame #62: + 0x150582 (0x55f2231f0582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #63: PyObject_Call + 0xbc (0x55f2231f0f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x563dfa5068fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: recv_activation_tensor = recv_activation() [default5]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:frame #55: _PyObject_MakeTpCall + 0x26b (0x56244cb2ba6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #56: _PyEval_EvalFrameDefault + 0x5723 (0x56244cb27c53 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #45: + 0x150582 (0x55580530c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55add1281f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: return forward_call(*args, **kwargs) [default5]: pipeline_state.run_communication() [default2]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55c6059f33e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55c6059fea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #57: + 0x150582 (0x56244cb3e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: return self._call_impl(*args, **kwargs) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default3]:frame #30: + 0x150582 (0x563dfa521582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]:frame #32: + 0x150582 (0x55be5ca8f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x563dfa5068fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x56244cb238fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55c6059eec5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #59: + 0x150582 (0x56244cb3e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default5]: recv_activation_tensor = recv_activation() [default7]:frame #46: PyObject_Call + 0xbc (0x55580530cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55add1293c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #28: _PyFunction_Vectorcall + 0x6c (0x5631d63c5a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5631d63b68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: return forward_call(*args, **kwargs) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default0]:frame #38: + 0x211239 (0x55add1356239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55add1282a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #32: + 0x150582 (0x563dfa521582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x563dfa5068fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55add127e3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55be5ca748fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #34: + 0x150582 (0x563dfa521582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #60: PyObject_Call + 0xbc (0x56244cb3ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55c6059fea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default0]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55add1289a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x563dfa5068fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5558052f32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default2]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x56244cb252b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #62: + 0x150582 (0x56244cb3e582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default2]:frame #63: PyObject_Call + 0xbc (0x56244cb3ef1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]:frame #48: + 0x150582 (0x55580530c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x563dfa50df50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #37: _PyObject_Call_Prepend + 0x69 (0x563dfa51fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #49: PyObject_Call + 0xbc (0x55580530cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #30: + 0x150582 (0x5631d63d1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55c6059ef8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5631d63b68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #45: + 0x150582 (0x55c605a0a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55add1279c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #34: + 0x150582 (0x55be5ca8f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #38: + 0x211239 (0x563dfa5e2239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5558052f32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #39: _PyObject_MakeTpCall + 0x26b (0x563dfa50ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #51: _PyFunction_Vectorcall + 0x6c (0x555805300a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x563dfa50a3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default4]: pipeline_state.run_communication() [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:frame #46: PyObject_Call + 0xbc (0x55c605a0af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55c6059f12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:frame #32: + 0x150582 (0x5631d63d1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5558052f9007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5631d63b68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55580530ac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #41: _PyFunction_Vectorcall + 0x6c (0x563dfa515a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #54: + 0x211239 (0x5558053cd239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x563dfa505c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default2]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default6]:frame #34: + 0x150582 (0x5631d63d1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: recv_activation_tensor = recv_activation() [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:frame #48: + 0x150582 (0x55c605a0a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #43: _PyFunction_Vectorcall + 0x6c (0x563dfa515a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #49: PyObject_Call + 0xbc (0x55c605a0af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55add1289a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x563dfa5068fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5631d63b68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #55: PyObject_Call + 0x207 (0x55580530d067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #45: + 0x150582 (0x563dfa521582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5558052f32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55add127a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #45: + 0x150582 (0x55add1295582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default4]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]: return self._call_impl(*args, **kwargs) [default4]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5631d63bdf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]:frame #37: _PyObject_Call_Prepend + 0x69 (0x5631d63cfc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55c6059f12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]: dist.recv( [default2]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55c6059fea2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #46: PyObject_Call + 0xbc (0x55add1295f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default6]: return forward_call(*args, **kwargs) [default0]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55add127c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #48: + 0x150582 (0x55add1295582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #38: + 0x211239 (0x5631d6492239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #57: + 0x150582 (0x55580530c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: return func(*args, **kwargs) [default7]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5558052f18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]: dist.recv( [default3]:frame #46: PyObject_Call + 0xbc (0x563dfa521f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x563dfa5082b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #49: PyObject_Call + 0xbc (0x55add1295f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55add127c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: dist.recv( [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55be5ca748fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #39: _PyObject_MakeTpCall + 0x26b (0x5631d63bea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55be5ca7bf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55be5ca8dc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5631d63ba3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #59: + 0x150582 (0x55580530c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #48: + 0x150582 (0x563dfa521582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #60: PyObject_Call + 0xbc (0x55580530cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default4]: return func(*args, **kwargs) [default4]:frame #38: + 0x211239 (0x55be5cb50239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55be5ca7ca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default6]:frame #41: _PyFunction_Vectorcall + 0x6c (0x5631d63c5a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55c6059f7007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55add1289a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5631d63b5c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default4]: pg.recv([tensor], group_src_rank, tag).wait() [default3]:frame #49: PyObject_Call + 0xbc (0x563dfa521f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #43: _PyFunction_Vectorcall + 0x6c (0x5631d63c5a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5631d63b68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55be5ca783e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55be5ca83a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default4]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55be5ca73c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5558052f32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: return func(*args, **kwargs) [default3]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x563dfa5082b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default5]: pg.recv([tensor], group_src_rank, tag).wait() [default5]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:torch.distributed.DistBackendError: [15] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '14:15', but store->get('14:15') got error: Connection reset by peer [default3]:frame #51: _PyFunction_Vectorcall + 0x6c (0x563dfa515a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55be5ca83a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55add1282007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x563dfa50e007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55c605a08c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:torch.distributed.DistBackendError: [15] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '14:15', but store->get('14:15') got error: Connection reset by peer [default7]:frame #62: + 0x150582 (0x55580530c582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55add1293c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #45: + 0x150582 (0x5631d63d1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55be5ca748fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #46: PyObject_Call + 0xbc (0x5631d63d1f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default6]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default5]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5631d63b82b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #45: + 0x150582 (0x55be5ca8f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #53: _PyObject_Call_Prepend + 0x69 (0x563dfa51fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #54: + 0x211239 (0x563dfa5e2239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #46: PyObject_Call + 0xbc (0x55be5ca8ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #54: + 0x211239 (0x55add1356239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc513009d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #55: PyObject_Call + 0x207 (0x55add1296067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #48: + 0x150582 (0x5631d63d1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #63: PyObject_Call + 0xbc (0x55580530cf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #49: PyObject_Call + 0xbc (0x5631d63d1f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:frame #1: + 0x589518e (0x7fc54afc318e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #55: _PyObject_MakeTpCall + 0x26b (0x563dfa50ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #54: + 0x211239 (0x55c605acb239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #55: PyObject_Call + 0x207 (0x55c605a0b067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #56: _PyEval_EvalFrameDefault + 0x5723 (0x563dfa50ac53 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55c6059f12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fecf45c6d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55add127c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #57: + 0x150582 (0x563dfa521582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #57: + 0x150582 (0x55c605a0a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55c6059ef8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x563dfa5068fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #57: + 0x150582 (0x55add1295582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5631d63b82b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: pipeline_state.run_communication() [default6]:frame #51: _PyFunction_Vectorcall + 0x6c (0x5631d63c5a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #59: + 0x150582 (0x55c605a0a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fc54afbd9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #60: PyObject_Call + 0xbc (0x55c605a0af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55add127a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fc54afbdce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #59: + 0x150582 (0x55add1295582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default4]:frame #1: + 0x589518e (0x7fed2c58018e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: recv_activation_tensor = recv_activation() [default6]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5631d63be007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55be5ca762b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #48: + 0x150582 (0x55be5ca8f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #53: _PyObject_Call_Prepend + 0x69 (0x5631d63cfc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #49: PyObject_Call + 0xbc (0x55be5ca8ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default5]:torch.distributed.DistBackendError: [5] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '4:5', but store->get('4:5') got error: Connection reset by peer [default4]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fed2c57a9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55be5ca762b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55be5ca83a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fed2c57ace2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55be5ca7c007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #54: + 0x211239 (0x5631d6492239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55be5ca8dc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #55: _PyObject_MakeTpCall + 0x26b (0x5631d63bea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default4]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fed2c57bb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fed2c530f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55c6059f12b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #56: _PyEval_EvalFrameDefault + 0x5723 (0x5631d63bac53 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #57: + 0x150582 (0x5631d63d1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #60: PyObject_Call + 0xbc (0x55add1295f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fed2c530f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55add127c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #59: + 0x150582 (0x563dfa521582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]:frame #60: PyObject_Call + 0xbc (0x563dfa521f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default4]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fed2c530f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #62: + 0x150582 (0x55c605a0a582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fed2c530f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default3]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x563dfa5082b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #62: + 0x150582 (0x563dfa521582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2784a33d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default3]:frame #63: PyObject_Call + 0xbc (0x563dfa521f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fecf576ec69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #63: PyObject_Call + 0xbc (0x55c605a0af1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #54: + 0x211239 (0x55be5cb50239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #55: PyObject_Call + 0x207 (0x55be5ca90067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fc54afbeb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default6]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5631d63b68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #62: + 0x150582 (0x55add1295582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #59: + 0x150582 (0x5631d63d1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default3]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc54af73f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #63: PyObject_Call + 0xbc (0x55add1295f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc54af73f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc54af73f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: dist.recv( [default4]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fecf5775c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55be5ca762b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #57: + 0x150582 (0x55be5ca8f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #60: PyObject_Call + 0xbc (0x5631d63d1f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5631d63b82b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #1: + 0x589518e (0x7f27bc9ed18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default4]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55be5ca748fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #59: + 0x150582 (0x55be5ca8f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc54af73f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: return func(*args, **kwargs) [default5]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fc5141b1c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default4]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fecf5798b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f27bc9e79a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fc5141b8c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fc5141dbb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f27bc9e7ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #60: PyObject_Call + 0xbc (0x55be5ca8ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55be5ca762b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #12: + 0x5838439 (0x7fed2c523439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:frame #13: + 0x5843330 (0x7fed2c52e330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #62: + 0x150582 (0x55be5ca8f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #63: PyObject_Call + 0xbc (0x55be5ca8ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #12: + 0x5838439 (0x7fc54af66439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f27bc9e8b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:torch.distributed.DistBackendError: [5] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '4:5', but store->get('4:5') got error: Connection reset by peer [default4]:frame #14: + 0x58433c5 (0x7fed2c52e3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default4]:frame #15: + 0x4e893cc (0x7fed2bb743cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default6]:frame #62: + 0x150582 (0x5631d63d1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fe480657d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #63: PyObject_Call + 0xbc (0x5631d63d1f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #1: + 0x589518e (0x7fe4b861118e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f27bc99df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #13: + 0x5843330 (0x7fc54af71330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f27bc99df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #14: + 0x58433c5 (0x7fc54af713c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #16: + 0x1a08a88 (0x7fed286f3a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #17: + 0x5849a84 (0x7fed2c534a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f27bc99df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #15: + 0x4e893cc (0x7fc54a5b73cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f27bc99df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #16: + 0x1a08a88 (0x7fc547136a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f2785bdbc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #18: + 0x584ed35 (0x7fed2c539d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fe4b860b9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fe4b860bce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #17: + 0x5849a84 (0x7fc54af77a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fe4b860cb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #19: + 0xc97eee (0x7fed3edebeee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f2785be2c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #18: + 0x584ed35 (0x7fc54af7cd35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe4b85c1f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #20: + 0x413ea4 (0x7fed3e567ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f2785c05b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #12: + 0x5838439 (0x7f27bc990439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #13: + 0x5843330 (0x7f27bc99b330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #19: + 0xc97eee (0x7fc55d82eeee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:frame #14: + 0x58433c5 (0x7f27bc99b3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #15: + 0x4e893cc (0x7f27bbfe13cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #21: + 0x1445a6 (0x5564518555a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe4b85c1f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55645184ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #23: + 0x150866 (0x556451861866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe4b85c1f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #16: + 0x1a08a88 (0x7f27b8b60a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #17: + 0x5849a84 (0x7f27bc9a1a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55645184a142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #25: _PyFunction_Vectorcall + 0x6c (0x556451855a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe4b85c1f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fe4817ffc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #26: PyObject_Call + 0xbc (0x556451861f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fe481806c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #20: + 0x413ea4 (0x7fc55cfaaea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fe481829b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #12: + 0x5838439 (0x7fe4b85b4439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5564518482b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #13: + 0x5843330 (0x7fe4b85bf330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #21: + 0x1445a6 (0x55907cb035a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #18: + 0x584ed35 (0x7f27bc9a6d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #28: _PyFunction_Vectorcall + 0x6c (0x556451855a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5564518468fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #14: + 0x58433c5 (0x7fe4b85bf3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #30: + 0x150582 (0x556451861582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #19: + 0xc97eee (0x7f27cf258eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5564518468fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #32: + 0x150582 (0x556451861582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #15: + 0x4e893cc (0x7fe4b7c053cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55907cafca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #16: + 0x1a08a88 (0x7fe4b4784a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #17: + 0x5849a84 (0x7fe4b85c5a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5564518468fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #20: + 0x413ea4 (0x7f27ce9d4ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:frame #23: + 0x150866 (0x55907cb0f866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #21: + 0x1445a6 (0x5599797475a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55907caf8142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55907cb03a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #22: _PyObject_MakeTpCall + 0x26b (0x559979740a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #34: + 0x150582 (0x556451861582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #23: + 0x150866 (0x559979753866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55997973c142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #26: PyObject_Call + 0xbc (0x55907cb0ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #25: _PyFunction_Vectorcall + 0x6c (0x559979747a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55907caf62b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55907cb03a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #26: PyObject_Call + 0xbc (0x559979753f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55997973a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5564518468fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #28: _PyFunction_Vectorcall + 0x6c (0x559979747a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55907caf48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #18: + 0x584ed35 (0x7fe4b85cad35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #30: + 0x150582 (0x55907cb0f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55645184df50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55645185fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5599797388fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #38: + 0x211239 (0x556451922239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #30: + 0x150582 (0x559979753582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55645184ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55907caf48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5599797388fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #32: + 0x150582 (0x559979753582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5599797388fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55645184a3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #41: _PyFunction_Vectorcall + 0x6c (0x556451855a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #34: + 0x150582 (0x559979753582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x556451845c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #32: + 0x150582 (0x55907cb0f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55907caf48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5599797388fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55997973ff50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #34: + 0x150582 (0x55907cb0f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55907caf48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #37: _PyObject_Call_Prepend + 0x69 (0x559979751c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #38: + 0x211239 (0x559979814239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #43: _PyFunction_Vectorcall + 0x6c (0x556451855a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #19: + 0xc97eee (0x7fe4cae7ceee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:frame #20: + 0x413ea4 (0x7fe4ca5f8ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55907cafbf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #21: + 0x1445a6 (0x56200030c5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #22: _PyObject_MakeTpCall + 0x26b (0x562000305a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55907cb0dc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #38: + 0x211239 (0x55907cbd0239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #23: + 0x150866 (0x562000318866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5564518468fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #45: + 0x150582 (0x556451861582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #46: PyObject_Call + 0xbc (0x556451861f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #39: _PyObject_MakeTpCall + 0x26b (0x559979740a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5564518482b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55997973c3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #41: _PyFunction_Vectorcall + 0x6c (0x559979747a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55907cafca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x562000301142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #48: + 0x150582 (0x556451861582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #25: _PyFunction_Vectorcall + 0x6c (0x56200030ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55907caf83e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #26: PyObject_Call + 0xbc (0x562000318f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5620002ff2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #49: PyObject_Call + 0xbc (0x556451861f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x559979737c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #43: _PyFunction_Vectorcall + 0x6c (0x559979747a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5599797388fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55907cb03a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #45: + 0x150582 (0x559979753582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5564518482b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #28: _PyFunction_Vectorcall + 0x6c (0x56200030ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #51: _PyFunction_Vectorcall + 0x6c (0x556451855a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5620002fd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #46: PyObject_Call + 0xbc (0x559979753f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55907caf3c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #30: + 0x150582 (0x562000318582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5620002fd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55645184e007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55997973a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55907cb03a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #32: + 0x150582 (0x562000318582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55645185fc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #54: + 0x211239 (0x556451922239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5620002fd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #34: + 0x150582 (0x562000318582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #55: _PyObject_MakeTpCall + 0x26b (0x55645184ea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55907caf48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #48: + 0x150582 (0x559979753582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #56: _PyEval_EvalFrameDefault + 0x5723 (0x55645184ac53 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5620002fd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #57: + 0x150582 (0x556451861582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #49: PyObject_Call + 0xbc (0x559979753f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #45: + 0x150582 (0x55907cb0f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x562000304f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5564518468fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55997973a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #46: PyObject_Call + 0xbc (0x55907cb0ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #37: _PyObject_Call_Prepend + 0x69 (0x562000316c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #59: + 0x150582 (0x556451861582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #38: + 0x211239 (0x5620003d9239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55907caf62b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #39: _PyObject_MakeTpCall + 0x26b (0x562000305a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5620003013e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #60: PyObject_Call + 0xbc (0x556451861f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #41: _PyFunction_Vectorcall + 0x6c (0x56200030ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5564518482b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #51: _PyFunction_Vectorcall + 0x6c (0x559979747a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #62: + 0x150582 (0x556451861582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #48: + 0x150582 (0x55907cb0f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #49: PyObject_Call + 0xbc (0x55907cb0ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5620002fcc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #63: PyObject_Call + 0xbc (0x556451861f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x559979740007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55907caf62b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #43: _PyFunction_Vectorcall + 0x6c (0x56200030ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55907cb03a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #53: _PyObject_Call_Prepend + 0x69 (0x559979751c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55907cafc007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55907cb0dc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #54: + 0x211239 (0x55907cbd0239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5620002fd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #55: _PyObject_MakeTpCall + 0x26b (0x55907cafca6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #56: _PyEval_EvalFrameDefault + 0x5723 (0x55907caf8c53 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #57: + 0x150582 (0x55907cb0f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #54: + 0x211239 (0x559979814239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55907caf48fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #59: + 0x150582 (0x55907cb0f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #60: PyObject_Call + 0xbc (0x55907cb0ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55907caf62b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #55: PyObject_Call + 0x207 (0x559979754067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #62: + 0x150582 (0x55907cb0f582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #63: PyObject_Call + 0xbc (0x55907cb0ff1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55997973a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #57: + 0x150582 (0x559979753582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5599797388fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #15: + 0x4e893cc (0x7f5368cd03cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #45: + 0x150582 (0x562000318582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default5]:frame #59: + 0x150582 (0x559979753582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5577f18d18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #1: + 0x589518e (0x7fc404c4d18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #60: PyObject_Call + 0xbc (0x559979753f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55997973a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55a04ba9ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #46: PyObject_Call + 0xbc (0x562000318f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default5]:frame #62: + 0x150582 (0x559979753582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #63: PyObject_Call + 0xbc (0x559979753f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1655353d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5620002ff2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default5]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default0]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:frame #48: + 0x150582 (0x562000318582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #49: PyObject_Call + 0xbc (0x562000318f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5620002ff2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #45: + 0x150582 (0x5577f18ec582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #51: _PyFunction_Vectorcall + 0x6c (0x56200030ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x562000305007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #53: _PyObject_Call_Prepend + 0x69 (0x562000316c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fc404c479a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #54: + 0x211239 (0x5620003d9239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #55: PyObject_Call + 0x207 (0x562000319067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5620002ff2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55a04ba95007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #57: + 0x150582 (0x562000318582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5620002fd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #59: + 0x150582 (0x562000318582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #60: PyObject_Call + 0xbc (0x562000318f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5620002ff2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #62: + 0x150582 (0x562000318582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #63: PyObject_Call + 0xbc (0x562000318f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #16: + 0x1a08a88 (0x7f536584fa88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default5]:frame #17: + 0x5849a84 (0x7f5369690a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fc404c47ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55bb40da1c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #54: + 0x211239 (0x55bb40e64239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fc404c48b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc404bfdf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default6]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc404bfdf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #55: PyObject_Call + 0x207 (0x55bb40da4067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55a04baa6c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: return self._call_impl(*args, **kwargs) [default4]:frame #1: + 0x589518e (0x7f168d30d18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #13: + 0x5843330 (0x7f6f75d96330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #14: + 0x58433c5 (0x7f6f75d963c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f168d3079a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default5]:frame #18: + 0x584ed35 (0x7f5369695d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55bb40d8a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #57: + 0x150582 (0x55bb40da3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]: return forward_call(*args, **kwargs) [default3]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default1]:frame #46: PyObject_Call + 0xbc (0x5577f18ecf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #15: + 0x4e893cc (0x7f6f753dc3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5577f18d32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f168d307ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default5]:frame #19: + 0xc97eee (0x7f537bf47eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default2]: pg.recv([tensor], group_src_rank, tag).wait() [default4]:frame #16: + 0x1a08a88 (0x7f6f71f5ba88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #17: + 0x5849a84 (0x7f6f75d9ca84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc404bfdf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default3]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fc404bfdf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #54: + 0x211239 (0x55a04bb69239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55bb40d888fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default5]:frame #20: + 0x413ea4 (0x7f537b6c3ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:frame #18: + 0x584ed35 (0x7f6f75da1d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f168d308b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f168d2bdf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #59: + 0x150582 (0x55bb40da3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #55: PyObject_Call + 0x207 (0x55a04baa9067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #21: + 0x1445a6 (0x555acd35b5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return self._call_impl(*args, **kwargs) [default0]: pipeline_state.run_communication() [default0]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default4]:frame #19: + 0xc97eee (0x7f6f88653eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:frame #20: + 0x413ea4 (0x7f6f87dcfea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default5]:frame #60: PyObject_Call + 0xbc (0x55bb40da3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55bb40d8a2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: dist.recv( [default5]:frame #62: + 0x150582 (0x55bb40da3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:torch.distributed.DistBackendError: [12] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '11:12', but store->get('11:12') got error: Connection reset by peer [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]: recv_activation_tensor = recv_activation() [default3]: return forward_call(*args, **kwargs) [default2]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default5]:frame #22: _PyObject_MakeTpCall + 0x26b (0x555acd354a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #63: PyObject_Call + 0xbc (0x55bb40da3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default5]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default4]:frame #21: + 0x1445a6 (0x55db942fc5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55db942f5a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default1]:frame #48: + 0x150582 (0x5577f18ec582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default5]:frame #23: + 0x150866 (0x555acd367866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default4]:frame #23: + 0x150866 (0x55db94308866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default4]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f168d2bdf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: return func(*args, **kwargs) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default3]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55a04ba8f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x555acd350142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: return forward_call(*args, **kwargs) [default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2a54dfdd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fc3cde3bc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]: pipeline_state.run_communication() [default7]: dist.recv( [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default4]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f168d2bdf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default3]:frame #57: + 0x150582 (0x55a04baa8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #25: _PyFunction_Vectorcall + 0x6c (0x555acd35ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default4]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55db942f1142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fc3cde42c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]: recv_activation_tensor = recv_activation() [default6]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fc3cde65b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]: pipeline_state.run_communication() [default2]:frame #1: + 0x589518e (0x7f2a8cdb718e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #26: PyObject_Call + 0xbc (0x555acd367f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return func(*args, **kwargs) [default0]: pg.recv([tensor], group_src_rank, tag).wait() [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default3]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55a04ba8d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55db942fca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #26: PyObject_Call + 0xbc (0x55db94308f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f2a8cdb19a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default2]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f2a8cdb1ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default0]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:frame #49: PyObject_Call + 0xbc (0x5577f18ecf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default1]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5577f18d32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #59: + 0x150582 (0x55a04baa8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: recv_activation_tensor = recv_activation() [default5]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x555acd34e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default4]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55db942ef2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #28: _PyFunction_Vectorcall + 0x6c (0x555acd35ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #12: + 0x5838439 (0x7fc404bf0439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]:torch.distributed.DistBackendError: [12] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '11:12', but store->get('11:12') got error: Connection reset by peer [default0]: dist.recv( [default0]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default3]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default5]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x555acd34c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default7]: pg.recv([tensor], group_src_rank, tag).wait() [default0]: return func(*args, **kwargs) [default4]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55db942fca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default1]:frame #51: _PyFunction_Vectorcall + 0x6c (0x5577f18e0a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f168d2bdf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f6bb2b01d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default0]:frame #1: + 0x589518e (0x7f6beaabb18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #60: PyObject_Call + 0xbc (0x55a04baa8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default7]:torch.distributed.DistBackendError: [13] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '12:13', but store->get('12:13') got error: Connection reset by peer [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default4]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f16564fbc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]:frame #13: + 0x5843330 (0x7fc404bfb330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:frame #14: + 0x58433c5 (0x7fc404bfb3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]:frame #15: + 0x4e893cc (0x7fc4042413cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: dist.recv( [default2]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f2a8cdb2b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f1656502c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:torch.distributed.DistBackendError: [8] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '7:8', but store->get('7:8') got error: Connection reset by peer [default4]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f1656525b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55a04ba8f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #62: + 0x150582 (0x55a04baa8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]:frame #16: + 0x1a08a88 (0x7fc400dc0a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default0]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default6]:frame #17: + 0x5849a84 (0x7fc404c01a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff876e43d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f6beaab59a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #1: + 0x589518e (0x7ff8aedfd18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2a8cd67f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55db942ed8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #30: + 0x150582 (0x555acd367582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]: return func(*args, **kwargs) [default5]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x555acd34c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #30: + 0x150582 (0x55db94308582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f6beaab5ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #12: + 0x5838439 (0x7f168d2b0439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #18: + 0x584ed35 (0x7fc404c06d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default0]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7ff8aedf79a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #63: PyObject_Call + 0xbc (0x55a04baa8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: dist.recv( [default3]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default4]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55db942ed8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7ff8aedf7ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f6beaab6b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:frame #19: + 0xc97eee (0x7fc4174b8eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7ff8aedf8b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #20: + 0x413ea4 (0x7fc416c34ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]: return func(*args, **kwargs) [default5]:frame #32: + 0x150582 (0x555acd367582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default5]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x555acd34c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff8aedadf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5577f18d9007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #21: + 0x1445a6 (0x56520775c5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:torch.distributed.DistBackendError: [8] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '7:8', but store->get('7:8') got error: Connection reset by peer [default0]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6beaa6bf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6beaa6bf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fddac069d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default4]:frame #32: + 0x150582 (0x55db94308582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #22: _PyObject_MakeTpCall + 0x26b (0x565207755a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #23: + 0x150866 (0x565207768866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:torch.distributed.DistBackendError: [9] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '8:9', but store->get('8:9') got error: Connection reset by peer [default4]:frame #13: + 0x5843330 (0x7f168d2bb330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #14: + 0x58433c5 (0x7f168d2bb3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55db942ed8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #53: _PyObject_Call_Prepend + 0x69 (0x5577f18eac39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff8aedadf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #54: + 0x211239 (0x5577f19ad239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff8aedadf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6beaa6bf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x565207751142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #34: + 0x150582 (0x55db94308582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2a8cd67f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2a8cd67f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3234954d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default1]:frame #55: PyObject_Call + 0x207 (0x5577f18ed067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff8aedadf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #25: _PyFunction_Vectorcall + 0x6c (0x56520775ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #26: PyObject_Call + 0xbc (0x565207768f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7ff877febc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55db942ed8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #15: + 0x4e893cc (0x7f168c9013cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55db942f4f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5577f18d32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #1: + 0x589518e (0x7fdde402318e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7ff877ff2c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fdde401d9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2a8cd67f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f2a55fa5c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #1: + 0x589518e (0x7f326c90e18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #34: + 0x150582 (0x555acd367582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55db94306c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fdde401dce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff3014f2d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default2]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f2a55facc5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #1: + 0x589518e (0x7ff3394ac18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x555acd34c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #38: + 0x211239 (0x55db943c9239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #16: + 0x1a08a88 (0x7f1689480a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7ff3394a69a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #57: + 0x150582 (0x5577f18ec582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55db942f5a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x56520774f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6beaa6bf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7ff878015b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f6bb3ca9c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fdde401eb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #12: + 0x5838439 (0x7ff8aeda0439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f2a55fcfb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55db942f13e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #28: _PyFunction_Vectorcall + 0x6c (0x56520775ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x555acd353f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f326c9089a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fdde3fd3f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f326c908ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fdde3fd3f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55db942fca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fdde3fd3f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7ff3394a6ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x56520774d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7ff3394a7b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #12: + 0x5838439 (0x7f2a8cd5a439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #17: + 0x5849a84 (0x7f168d2c1a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55db942ecc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fdde3fd3f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #13: + 0x5843330 (0x7ff8aedab330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f6bb3cb0c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff33945cf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #13: + 0x5843330 (0x7f2a8cd65330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f326c909b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #14: + 0x58433c5 (0x7ff8aedab3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5577f18d18fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff33945cf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #30: + 0x150582 (0x565207768582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f326c8bef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #14: + 0x58433c5 (0x7f2a8cd653c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f326c8bef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55db942fca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #59: + 0x150582 (0x5577f18ec582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #18: + 0x584ed35 (0x7f168d2c6d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff33945cf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x56520774d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fddad211c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default3]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f326c8bef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #15: + 0x4e893cc (0x7f2a8c3ab3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7ff33945cf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #37: _PyObject_Call_Prepend + 0x69 (0x555acd365c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7ff30269ac69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #32: + 0x150582 (0x565207768582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7ff3026a1c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55db942ed8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f326c8bef81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default1]:frame #60: PyObject_Call + 0xbc (0x5577f18ecf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7ff3026c4b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #19: + 0xc97eee (0x7f169fb78eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f6bb3cd3b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #12: + 0x5838439 (0x7ff33944f439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x56520774d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fddad218c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #45: + 0x150582 (0x55db94308582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #20: + 0x413ea4 (0x7f169f2f4ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f3235afcc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default2]:frame #16: + 0x1a08a88 (0x7f2a88f2aa88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f3235b03c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #15: + 0x4e893cc (0x7ff8ae3f13cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #34: + 0x150582 (0x565207768582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #13: + 0x5843330 (0x7ff33945a330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #38: + 0x211239 (0x555acd428239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #46: PyObject_Call + 0xbc (0x55db94308f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5577f18d32b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f3235b26b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fddad23bb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55db942ef2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x56520774d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #14: + 0x58433c5 (0x7ff33945a3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #39: _PyObject_MakeTpCall + 0x26b (0x555acd354a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #15: + 0x4e893cc (0x7ff338aa03cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #16: + 0x1a08a88 (0x7ff33561fa88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #12: + 0x5838439 (0x7f6beaa5e439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #12: + 0x5838439 (0x7f326c8b1439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #21: + 0x1445a6 (0x55b80453b5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #16: + 0x1a08a88 (0x7ff8aaf70a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #17: + 0x5849a84 (0x7ff339460a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x565207754f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x555acd3503e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #13: + 0x5843330 (0x7f326c8bc330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #12: + 0x5838439 (0x7fdde3fc6439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #48: + 0x150582 (0x55db94308582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #13: + 0x5843330 (0x7f6beaa69330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #49: PyObject_Call + 0xbc (0x55db94308f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #14: + 0x58433c5 (0x7f6beaa693c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #17: + 0x5849a84 (0x7ff8aedb1a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #15: + 0x4e893cc (0x7f6bea0af3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55db942ef2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #17: + 0x5849a84 (0x7f2a8cd6ba84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55b804534a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55db942fca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #41: _PyFunction_Vectorcall + 0x6c (0x555acd35ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #14: + 0x58433c5 (0x7f326c8bc3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #13: + 0x5843330 (0x7fdde3fd1330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55db942f5007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #62: + 0x150582 (0x5577f18ec582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #18: + 0x584ed35 (0x7ff339465d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #16: + 0x1a08a88 (0x7f6be6c2ea88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #18: + 0x584ed35 (0x7ff8aedb6d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #17: + 0x5849a84 (0x7f6beaa6fa84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default3]:frame #15: + 0x4e893cc (0x7f326bf023cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x555acd34bc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #16: + 0x1a08a88 (0x7f3268a81a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55db94306c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default1]:frame #63: PyObject_Call + 0xbc (0x5577f18ecf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #19: + 0xc97eee (0x7ff8c1668eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default1]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default4]:frame #23: + 0x150866 (0x55b804547866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #20: + 0x413ea4 (0x7ff8c0de4ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:frame #18: + 0x584ed35 (0x7f6beaa74d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #54: + 0x211239 (0x55db943c9239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #37: _PyObject_Call_Prepend + 0x69 (0x565207766c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #17: + 0x5849a84 (0x7f326c8c2a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #14: + 0x58433c5 (0x7fdde3fd13c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #55: PyObject_Call + 0x207 (0x55db94309067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #15: + 0x4e893cc (0x7fdde36173cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55db942ef2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #16: + 0x1a08a88 (0x7fdde0196a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #57: + 0x150582 (0x55db94308582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #43: _PyFunction_Vectorcall + 0x6c (0x555acd35ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55db942ed8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x555acd34c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #59: + 0x150582 (0x55db94308582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #38: + 0x211239 (0x565207829239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #21: + 0x1445a6 (0x55f1e97e75a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #17: + 0x5849a84 (0x7fdde3fd7a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #19: + 0xc97eee (0x7ff34bd17eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:frame #60: PyObject_Call + 0xbc (0x55db94308f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #19: + 0xc97eee (0x7f6bfd326eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55db942ef2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55b804530142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #20: + 0x413ea4 (0x7ff34b493ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #18: + 0x584ed35 (0x7f326c8c7d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #39: _PyObject_MakeTpCall + 0x26b (0x565207755a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5652077513e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #19: + 0xc97eee (0x7f327f179eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:frame #20: + 0x413ea4 (0x7f6bfcaa2ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #20: + 0x413ea4 (0x7f327e8f5ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:frame #45: + 0x150582 (0x555acd367582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #18: + 0x584ed35 (0x7fdde3fdcd35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55f1e97e0a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #19: + 0xc97eee (0x7fddf688eeee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:frame #21: + 0x1445a6 (0x561787cd55a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #21: + 0x1445a6 (0x557b4d73d5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #21: + 0x1445a6 (0x55b4a6adf5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55b80453ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55b4a6ad8a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #62: + 0x150582 (0x55db94308582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #26: PyObject_Call + 0xbc (0x55b804547f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #22: _PyObject_MakeTpCall + 0x26b (0x561787ccea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #41: _PyFunction_Vectorcall + 0x6c (0x56520775ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #23: + 0x150866 (0x561787ce1866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #46: PyObject_Call + 0xbc (0x555acd367f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #63: PyObject_Call + 0xbc (0x55db94308f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55b80452e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #22: _PyObject_MakeTpCall + 0x26b (0x557b4d736a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #23: + 0x150866 (0x55b4a6aeb866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x561787cca142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x56520774cc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #18: + 0x584ed35 (0x7f2a8cd70d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default4]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55b80453ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #23: + 0x150866 (0x55f1e97f3866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #43: _PyFunction_Vectorcall + 0x6c (0x56520775ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55f1e97dc142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55f1e97e7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #23: + 0x150866 (0x557b4d749866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x557b4d732142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55b4a6ad4142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #19: + 0xc97eee (0x7f2a9f622eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55b4a6adfa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55b80452c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x555acd34e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #26: PyObject_Call + 0xbc (0x55f1e97f3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #20: + 0x413ea4 (0x7fddf600aea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default3]:frame #26: PyObject_Call + 0xbc (0x55b4a6aebf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #20: + 0x413ea4 (0x7f2a9ed9eea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55f1e97da2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #21: + 0x1445a6 (0x5590e7c845a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #25: _PyFunction_Vectorcall + 0x6c (0x561787cd5a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55b4a6ad22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #48: + 0x150582 (0x555acd367582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55f1e97e7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #21: + 0x1445a6 (0x55f4f431a5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #26: PyObject_Call + 0xbc (0x561787ce1f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x56520774d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55b4a6adfa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #22: _PyObject_MakeTpCall + 0x26b (0x5590e7c7da6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55f1e97d88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55f4f4313a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #30: + 0x150582 (0x55f1e97f3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55b4a6ad08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #30: + 0x150582 (0x55b4a6aeb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #45: + 0x150582 (0x565207768582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x561787cc82b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #23: + 0x150866 (0x5590e7c90866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55f1e97d88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #30: + 0x150582 (0x55b804547582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55b80452c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #28: _PyFunction_Vectorcall + 0x6c (0x561787cd5a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #46: PyObject_Call + 0xbc (0x565207768f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x56520774f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55b4a6ad08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #25: _PyFunction_Vectorcall + 0x6c (0x557b4d73da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #48: + 0x150582 (0x565207768582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #32: + 0x150582 (0x55b4a6aeb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x561787cc68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #32: + 0x150582 (0x55f1e97f3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #49: PyObject_Call + 0xbc (0x565207768f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #23: + 0x150866 (0x55f4f4326866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #49: PyObject_Call + 0xbc (0x555acd367f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55b4a6ad08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #26: PyObject_Call + 0xbc (0x557b4d749f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5590e7c79142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55f1e97d88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #32: + 0x150582 (0x55b804547582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x555acd34e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #34: + 0x150582 (0x55f1e97f3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #34: + 0x150582 (0x55b4a6aeb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #30: + 0x150582 (0x561787ce1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #51: _PyFunction_Vectorcall + 0x6c (0x555acd35ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55b4a6ad08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55f4f430f142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x561787cc68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55f1e97d88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x557b4d7302b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #32: + 0x150582 (0x561787ce1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55b80452c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55f1e97dff50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55f4f431aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x561787cc68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x56520774f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #25: _PyFunction_Vectorcall + 0x6c (0x5590e7c84a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55f1e97f1c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #38: + 0x211239 (0x55f1e98b4239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55b4a6ad7f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #34: + 0x150582 (0x55b804547582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #34: + 0x150582 (0x561787ce1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #28: _PyFunction_Vectorcall + 0x6c (0x557b4d73da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x557b4d72e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55f1e97e0a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #26: PyObject_Call + 0xbc (0x5590e7c90f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55b4a6ae9c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55b80452c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55b804533f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x561787cc68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #30: + 0x150582 (0x557b4d749582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x555acd354007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55f1e97dc3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #26: PyObject_Call + 0xbc (0x55f4f4326f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55f1e97e7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x561787ccdf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #38: + 0x211239 (0x55b4a6bac239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55b804545c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #37: _PyObject_Call_Prepend + 0x69 (0x561787cdfc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #38: + 0x211239 (0x561787da2239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #38: + 0x211239 (0x55b804608239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55f1e97d7c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55f1e97e7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55b804534a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55b4a6ad8a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5590e7c772b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55f1e97d88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #28: _PyFunction_Vectorcall + 0x6c (0x5590e7c84a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #45: + 0x150582 (0x55f1e97f3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #39: _PyObject_MakeTpCall + 0x26b (0x561787ccea6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55b4a6ad43e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #51: _PyFunction_Vectorcall + 0x6c (0x56520775ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x561787cca3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x565207755007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55b4a6adfa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x557b4d72e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #32: + 0x150582 (0x557b4d749582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55b4a6acfc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #53: _PyObject_Call_Prepend + 0x69 (0x565207766c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #41: _PyFunction_Vectorcall + 0x6c (0x561787cd5a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55b8045303e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #53: _PyObject_Call_Prepend + 0x69 (0x555acd365c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55b4a6adfa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #54: + 0x211239 (0x555acd428239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x561787cc5c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #55: PyObject_Call + 0x207 (0x555acd368067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #54: + 0x211239 (0x565207829239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #46: PyObject_Call + 0xbc (0x55f1e97f3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #55: PyObject_Call + 0x207 (0x565207769067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #43: _PyFunction_Vectorcall + 0x6c (0x561787cd5a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55b80453ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55b4a6ad08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x555acd34e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55f1e97da2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #48: + 0x150582 (0x55f1e97f3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x56520774f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5590e7c758fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x561787cc68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #30: + 0x150582 (0x5590e7c90582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #45: + 0x150582 (0x55b4a6aeb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5590e7c758fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #49: PyObject_Call + 0xbc (0x55f1e97f3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #57: + 0x150582 (0x565207768582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x56520774d8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55f1e97da2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #57: + 0x150582 (0x555acd367582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #46: PyObject_Call + 0xbc (0x55b4a6aebf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:Traceback (most recent call last): [default0]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x557b4d72e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55f4f430d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55f4f431aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55b4a6ad22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55f4f430b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #48: + 0x150582 (0x55b4a6aeb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55b80452bc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #32: + 0x150582 (0x5590e7c90582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5590e7c758fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #45: + 0x150582 (0x561787ce1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #59: + 0x150582 (0x565207768582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: trainer.train(dataloader) [default6]:frame #60: PyObject_Call + 0xbc (0x565207768f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x555acd34c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #49: PyObject_Call + 0xbc (0x55b4a6aebf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #30: + 0x150582 (0x55f4f4326582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55b4a6ad22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55f4f430b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55b4a6adfa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default4]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55b80453ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55b80452c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x56520774f2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #46: PyObject_Call + 0xbc (0x561787ce1f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #34: + 0x150582 (0x5590e7c90582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55b4a6ad8007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #32: + 0x150582 (0x55f4f4326582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default6]:frame #62: + 0x150582 (0x565207768582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x561787cc82b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #48: + 0x150582 (0x561787ce1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5590e7c758fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #49: PyObject_Call + 0xbc (0x561787ce1f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55b4a6ae9c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5590e7c7cf50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #54: + 0x211239 (0x55b4a6bac239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: outputs = self.pipeline_engine.train_batch_iter( [default5]:frame #59: + 0x150582 (0x555acd367582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default6]:frame #63: PyObject_Call + 0xbc (0x565207768f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x561787cc82b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55f4f430b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default2]:frame #37: _PyObject_Call_Prepend + 0x69 (0x5590e7c8ec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #45: + 0x150582 (0x55b804547582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55f1e97e7a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #34: + 0x150582 (0x55f4f4326582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55f1e97e0007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #38: + 0x211239 (0x5590e7d51239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #34: + 0x150582 (0x557b4d749582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #46: PyObject_Call + 0xbc (0x55b804547f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55f4f430b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: output = model(**micro_batch) [default2]:frame #39: _PyObject_MakeTpCall + 0x26b (0x5590e7c7da6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #51: _PyFunction_Vectorcall + 0x6c (0x561787cd5a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #60: PyObject_Call + 0xbc (0x555acd367f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x561787cce007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #53: _PyObject_Call_Prepend + 0x69 (0x561787cdfc39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x555acd34e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #55: PyObject_Call + 0x207 (0x55b4a6aec067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #62: + 0x150582 (0x555acd367582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55f4f4312f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55b4a6ad22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default0]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55f1e97f1c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #63: PyObject_Call + 0xbc (0x555acd367f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #54: + 0x211239 (0x561787da2239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x557b4d72e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #57: + 0x150582 (0x55b4a6aeb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55b80452e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #48: + 0x150582 (0x55b804547582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #55: PyObject_Call + 0x207 (0x561787ce2067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x557b4d735f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return self._call_impl(*args, **kwargs) [default2]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5590e7c793e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55b4a6ad08fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #41: _PyFunction_Vectorcall + 0x6c (0x5590e7c84a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default3]:frame #59: + 0x150582 (0x55b4a6aeb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55f4f4324c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default0]:frame #37: _PyObject_Call_Prepend + 0x69 (0x557b4d747c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #54: + 0x211239 (0x55f1e98b4239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #38: + 0x211239 (0x55f4f43e7239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #55: PyObject_Call + 0x207 (0x55f1e97f4067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5590e7c74c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default3]:frame #60: PyObject_Call + 0xbc (0x55b4a6aebf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #38: + 0x211239 (0x557b4d80a239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55f1e97da2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #49: PyObject_Call + 0xbc (0x55b804547f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: sharded_logits = self.model( [default2]:frame #43: _PyFunction_Vectorcall + 0x6c (0x5590e7c84a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55f4f4313a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55b4a6ad22b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55b80452e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #62: + 0x150582 (0x55b4a6aeb582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55f4f430f3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:frame #63: PyObject_Call + 0xbc (0x55b4a6aebf1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x561787cc82b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55b80453ba2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #57: + 0x150582 (0x561787ce1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5590e7c758fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #57: + 0x150582 (0x55f1e97f3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #39: _PyObject_MakeTpCall + 0x26b (0x557b4d736a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55b804534007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default3]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default2]:frame #45: + 0x150582 (0x5590e7c90582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #46: PyObject_Call + 0xbc (0x5590e7c90f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55f4f431aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x561787cc68fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x557b4d7323e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #41: _PyFunction_Vectorcall + 0x6c (0x557b4d73da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return self._call_impl(*args, **kwargs) [default4]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55b804545c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55f1e97d88fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5590e7c772b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x557b4d72dc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #43: _PyFunction_Vectorcall + 0x6c (0x557b4d73da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default2]:frame #48: + 0x150582 (0x5590e7c90582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #59: + 0x150582 (0x561787ce1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x557b4d72e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55f4f430ac5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return forward_call(*args, **kwargs) [default7]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55f4f431aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #59: + 0x150582 (0x55f1e97f3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #54: + 0x211239 (0x55b804608239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default4]:frame #55: PyObject_Call + 0x207 (0x55b804548067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55b80452e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #60: PyObject_Call + 0xbc (0x561787ce1f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #45: + 0x150582 (0x557b4d749582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default2]:frame #49: PyObject_Call + 0xbc (0x5590e7c90f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55f4f430b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #60: PyObject_Call + 0xbc (0x55f1e97f3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #57: + 0x150582 (0x55b804547582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x561787cc82b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default4]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55b80452c8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #59: + 0x150582 (0x55b804547582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #62: + 0x150582 (0x561787ce1582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #45: + 0x150582 (0x55f4f4326582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55f1e97da2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #62: + 0x150582 (0x55f1e97f3582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #63: PyObject_Call + 0xbc (0x55f1e97f3f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #46: PyObject_Call + 0xbc (0x557b4d749f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #63: PyObject_Call + 0xbc (0x561787ce1f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5590e7c772b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default0]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x557b4d7302b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]:frame #46: PyObject_Call + 0xbc (0x55f4f4326f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default4]:frame #60: PyObject_Call + 0xbc (0x55b804547f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return self._call_impl(*args, **kwargs) [default4]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55b80452e2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #62: + 0x150582 (0x55b804547582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default0]:frame #48: + 0x150582 (0x557b4d749582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #49: PyObject_Call + 0xbc (0x557b4d749f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return forward_call(*args, **kwargs) [default4]:frame #63: PyObject_Call + 0xbc (0x55b804547f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default2]:frame #51: _PyFunction_Vectorcall + 0x6c (0x5590e7c84a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5590e7c7d007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default4]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]: pipeline_state.run_communication() [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default0]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x557b4d7302b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: recv_activation_tensor = recv_activation() [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default2]:frame #53: _PyObject_Call_Prepend + 0x69 (0x5590e7c8ec39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default0]:frame #51: _PyFunction_Vectorcall + 0x6c (0x557b4d73da2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default2]:frame #54: + 0x211239 (0x5590e7d51239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]: dist.recv( [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default0]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x557b4d736007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]: return func(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default7]: pg.recv([tensor], group_src_rank, tag).wait() [default2]:frame #55: PyObject_Call + 0x207 (0x5590e7c91067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5590e7c772b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:torch.distributed.DistBackendError: [9] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '8:9', but store->get('8:9') got error: Connection reset by peer [default7]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default0]:frame #53: _PyObject_Call_Prepend + 0x69 (0x557b4d747c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7faa16c1bd87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: + 0x589518e (0x7faa4ebd518e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #54: + 0x211239 (0x557b4d80a239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #57: + 0x150582 (0x5590e7c90582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7faa4ebcf9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default2]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5590e7c758fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7faa4ebcfce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7faa4ebd0b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7faa4eb85f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7faa4eb85f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55f4f430d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7faa4eb85f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7faa4eb85f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #48: + 0x150582 (0x55f4f4326582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7faa17dc3c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7faa17dcac5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7faa17dedb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #55: PyObject_Call + 0x207 (0x557b4d74a067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #59: + 0x150582 (0x5590e7c90582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #12: + 0x5838439 (0x7faa4eb78439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #13: + 0x5843330 (0x7faa4eb83330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #14: + 0x58433c5 (0x7faa4eb833c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #49: PyObject_Call + 0xbc (0x55f4f4326f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #15: + 0x4e893cc (0x7faa4e1c93cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #16: + 0x1a08a88 (0x7faa4ad48a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #17: + 0x5849a84 (0x7faa4eb89a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55f4f430d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #18: + 0x584ed35 (0x7faa4eb8ed35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #19: + 0xc97eee (0x7faa61440eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:frame #20: + 0x413ea4 (0x7faa60bbcea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default2]:frame #60: PyObject_Call + 0xbc (0x5590e7c90f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #21: + 0x1445a6 (0x5639130dc5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #22: _PyObject_MakeTpCall + 0x26b (0x5639130d5a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #23: + 0x150866 (0x5639130e8866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55f4f431aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5639130d1142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #25: _PyFunction_Vectorcall + 0x6c (0x5639130dca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #26: PyObject_Call + 0xbc (0x5639130e8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5590e7c772b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5639130cf2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #28: _PyFunction_Vectorcall + 0x6c (0x5639130dca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5639130cd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #30: + 0x150582 (0x5639130e8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55f4f4313007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5639130cd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #32: + 0x150582 (0x5639130e8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5639130cd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x557b4d7302b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #34: + 0x150582 (0x5639130e8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5639130cd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5639130d4f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #37: _PyObject_Call_Prepend + 0x69 (0x5639130e6c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55f4f4324c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #38: + 0x211239 (0x5639131a9239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #39: _PyObject_MakeTpCall + 0x26b (0x5639130d5a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5639130d13e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #41: _PyFunction_Vectorcall + 0x6c (0x5639130dca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #62: + 0x150582 (0x5590e7c90582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5639130ccc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #43: _PyFunction_Vectorcall + 0x6c (0x5639130dca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5639130cd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #54: + 0x211239 (0x55f4f43e7239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #45: + 0x150582 (0x5639130e8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #46: PyObject_Call + 0xbc (0x5639130e8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5639130cf2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:frame #63: PyObject_Call + 0xbc (0x5590e7c90f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #48: + 0x150582 (0x5639130e8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #49: PyObject_Call + 0xbc (0x5639130e8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5639130cf2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #51: _PyFunction_Vectorcall + 0x6c (0x5639130dca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default2]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default7]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5639130d5007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #53: _PyObject_Call_Prepend + 0x69 (0x5639130e6c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #54: + 0x211239 (0x5639131a9239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #55: PyObject_Call + 0x207 (0x5639130e9067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #55: PyObject_Call + 0x207 (0x55f4f4327067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5639130cf2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #57: + 0x150582 (0x5639130e8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5639130cd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #57: + 0x150582 (0x557b4d749582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #59: + 0x150582 (0x5639130e8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #60: PyObject_Call + 0xbc (0x5639130e8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5639130cf2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x557b4d72e8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #62: + 0x150582 (0x5639130e8582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #63: PyObject_Call + 0xbc (0x5639130e8f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default0]:frame #59: + 0x150582 (0x557b4d749582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #60: PyObject_Call + 0xbc (0x557b4d749f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #13: + 0x5843330 (0x7fcbeb4d4330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fe807993d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default0]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x557b4d7302b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #62: + 0x150582 (0x557b4d749582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #14: + 0x58433c5 (0x7fcbeb4d43c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7facddc65f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55f4f430d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #15: + 0x4e893cc (0x7fcbeab1a3cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #57: + 0x150582 (0x55f4f4326582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55f4f430b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7facddc65f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #59: + 0x150582 (0x55f4f4326582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #16: + 0x1a08a88 (0x7fcbe7699a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #60: PyObject_Call + 0xbc (0x55f4f4326f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #1: + 0x589518e (0x7fe83f94d18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #63: PyObject_Call + 0xbc (0x557b4d749f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7faca6ea3c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55f4f430d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #62: + 0x150582 (0x55f4f4326582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #17: + 0x5849a84 (0x7fcbeb4daa84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #63: PyObject_Call + 0xbc (0x55f4f4326f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default0]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7faca6eaac5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default4]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fe83f9479a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fe83f947ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fe83f948b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe83f8fdf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe83f8fdf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #18: + 0x584ed35 (0x7fcbeb4dfd35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe83f8fdf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7faca6ecdb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default4]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fe83f8fdf81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #19: + 0xc97eee (0x7fcbfdd91eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fe808b3bc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #20: + 0x413ea4 (0x7fcbfd50dea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default0]:frame #12: + 0x5838439 (0x7facddc58439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fe808b42c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default5]:frame #21: + 0x1445a6 (0x55ec91a4a5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #13: + 0x5843330 (0x7facddc63330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fe808b65b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default0]:frame #14: + 0x58433c5 (0x7facddc633c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55ec91a43a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #12: + 0x5838439 (0x7fe83f8f0439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #15: + 0x4e893cc (0x7facdd2a93cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #23: + 0x150866 (0x55ec91a56866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #13: + 0x5843330 (0x7fe83f8fb330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #16: + 0x1a08a88 (0x7facd9e28a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #14: + 0x58433c5 (0x7fe83f8fb3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55ec91a3f142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #15: + 0x4e893cc (0x7fe83ef413cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55ec91a4aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #26: PyObject_Call + 0xbc (0x55ec91a56f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #16: + 0x1a08a88 (0x7fe83bac0a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #17: + 0x5849a84 (0x7fe83f901a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55ec91a3d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55ec91a4aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #17: + 0x5849a84 (0x7facddc69a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default5]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55ec91a3b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #30: + 0x150582 (0x55ec91a56582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55ec91a3b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #32: + 0x150582 (0x55ec91a56582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55ec91a3b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #34: + 0x150582 (0x55ec91a56582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55ec91a3b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #18: + 0x584ed35 (0x7fe83f906d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default4]:frame #19: + 0xc97eee (0x7fe8521b8eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:frame #20: + 0x413ea4 (0x7fe851934ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:frame #21: + 0x1445a6 (0x55b25e3495a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #18: + 0x584ed35 (0x7facddc6ed35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default0]:frame #19: + 0xc97eee (0x7facf0520eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default4]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55b25e342a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55ec91a42f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #20: + 0x413ea4 (0x7facefc9cea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default5]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55ec91a54c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #21: + 0x1445a6 (0x5584db1695a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #23: + 0x150866 (0x55b25e355866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #38: + 0x211239 (0x55ec91b17239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55b25e33e142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #22: _PyObject_MakeTpCall + 0x26b (0x5584db162a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55ec91a43a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55ec91a3f3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #23: + 0x150866 (0x5584db175866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #25: _PyFunction_Vectorcall + 0x6c (0x55b25e349a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5584db15e142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #26: PyObject_Call + 0xbc (0x55b25e355f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x55b25e33c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #25: _PyFunction_Vectorcall + 0x6c (0x5584db169a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #26: PyObject_Call + 0xbc (0x5584db175f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5584db15c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55ec91a4aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #28: _PyFunction_Vectorcall + 0x6c (0x55b25e349a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #28: _PyFunction_Vectorcall + 0x6c (0x5584db169a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55ec91a3ac5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5584db15a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55ec91a4aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #30: + 0x150582 (0x5584db175582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x55b25e33a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55ec91a3b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #30: + 0x150582 (0x55b25e355582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5584db15a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x55b25e33a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #45: + 0x150582 (0x55ec91a56582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #32: + 0x150582 (0x5584db175582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #32: + 0x150582 (0x55b25e355582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5584db15a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #46: PyObject_Call + 0xbc (0x55ec91a56f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55ec91a3d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #34: + 0x150582 (0x5584db175582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5584db15a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x55b25e33a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5584db161f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #34: + 0x150582 (0x55b25e355582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #37: _PyObject_Call_Prepend + 0x69 (0x5584db173c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #48: + 0x150582 (0x55ec91a56582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x55b25e33a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #49: PyObject_Call + 0xbc (0x55ec91a56f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55b25e341f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #37: _PyObject_Call_Prepend + 0x69 (0x55b25e353c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55ec91a3d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #38: + 0x211239 (0x5584db236239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #38: + 0x211239 (0x55b25e416239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55ec91a4aa2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55b25e342a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #39: _PyObject_MakeTpCall + 0x26b (0x5584db162a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55b25e33e3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55ec91a43007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5584db15e3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #41: _PyFunction_Vectorcall + 0x6c (0x55b25e349a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55ec91a54c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #41: _PyFunction_Vectorcall + 0x6c (0x5584db169a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x55b25e339c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #54: + 0x211239 (0x55ec91b17239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5584db159c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #43: _PyFunction_Vectorcall + 0x6c (0x5584db169a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #55: PyObject_Call + 0x207 (0x55ec91a57067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #43: _PyFunction_Vectorcall + 0x6c (0x55b25e349a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x55b25e33a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5584db15a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #45: + 0x150582 (0x55b25e355582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55ec91a3d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #45: + 0x150582 (0x5584db175582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #46: PyObject_Call + 0xbc (0x55b25e355f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x55b25e33c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #46: PyObject_Call + 0xbc (0x5584db175f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #57: + 0x150582 (0x55ec91a56582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #48: + 0x150582 (0x55b25e355582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5584db15c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #49: PyObject_Call + 0xbc (0x55b25e355f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #48: + 0x150582 (0x5584db175582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55ec91a3b8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x55b25e33c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #49: PyObject_Call + 0xbc (0x5584db175f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #51: _PyFunction_Vectorcall + 0x6c (0x55b25e349a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #59: + 0x150582 (0x55ec91a56582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5584db15c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #60: PyObject_Call + 0xbc (0x55ec91a56f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #51: _PyFunction_Vectorcall + 0x6c (0x5584db169a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55ec91a3d2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55b25e342007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #62: + 0x150582 (0x55ec91a56582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5584db162007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #53: _PyObject_Call_Prepend + 0x69 (0x55b25e353c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #53: _PyObject_Call_Prepend + 0x69 (0x5584db173c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #54: + 0x211239 (0x55b25e416239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #54: + 0x211239 (0x5584db236239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:frame #63: PyObject_Call + 0xbc (0x55ec91a56f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #55: PyObject_Call + 0x207 (0x5584db176067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default5]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default4]:frame #55: PyObject_Call + 0x207 (0x55b25e356067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5584db15c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x55b25e33c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #57: + 0x150582 (0x55b25e355582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #57: + 0x150582 (0x5584db175582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x55b25e33a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #59: + 0x150582 (0x55b25e355582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #60: PyObject_Call + 0xbc (0x55b25e355f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x55b25e33c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5584db15a8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #62: + 0x150582 (0x55b25e355582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:frame #63: PyObject_Call + 0xbc (0x55b25e355f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #59: + 0x150582 (0x5584db175582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default4]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default0]:frame #60: PyObject_Call + 0xbc (0x5584db175f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5584db15c2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #62: + 0x150582 (0x5584db175582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:frame #63: PyObject_Call + 0xbc (0x5584db175f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default0]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default7]:Traceback (most recent call last): [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]: trainer.train(dataloader) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default7]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default7]: outputs = self.pipeline_engine.train_batch_iter( [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default7]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default7]: output = model(**micro_batch) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default7]: sharded_logits = self.model( [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default7]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default7]: return self._call_impl(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]: return forward_call(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default7]: pipeline_state.run_communication() [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default7]: recv_activation_tensor = recv_activation() [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default7]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default7]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default7]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default7]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default7]: dist.recv( [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default7]: return func(*args, **kwargs) [default7]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default7]: pg.recv([tensor], group_src_rank, tag).wait() [default7]:torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default7]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fdc8fec3d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default7]:frame #1: + 0x589518e (0x7fdcc7e7d18e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7fdcc7e779a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fdcc7e77ce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fdcc7e78b11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fdcc7e2df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fdcc7e2df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fdcc7e2df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fdcc7e2df81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fdc9106bc69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7fdc91072c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7fdc91095b60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default7]:frame #12: + 0x5838439 (0x7fdcc7e20439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #13: + 0x5843330 (0x7fdcc7e2b330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #14: + 0x58433c5 (0x7fdcc7e2b3c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #15: + 0x4e893cc (0x7fdcc74713cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #16: + 0x1a08a88 (0x7fdcc3ff0a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default7]:frame #17: + 0x5849a84 (0x7fdcc7e31a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:Traceback (most recent call last): [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 237, in [default7]:frame #18: + 0x584ed35 (0x7fdcc7e36d35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]: trainer.train(dataloader) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 430, in train [default6]: outputs, loss_avg = self.training_step(dataloader=self.current_dataloader) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 459, in training_step [default6]: outputs = self.pipeline_engine.train_batch_iter( [default7]:frame #19: + 0xc97eee (0x7fdcda6e8eee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:frame #20: + 0x413ea4 (0x7fdcd9e64ea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default7]:frame #21: + 0x1445a6 (0x5603d280c5a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #22: _PyObject_MakeTpCall + 0x26b (0x5603d2805a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #23: + 0x150866 (0x5603d2818866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x5603d2801142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #25: _PyFunction_Vectorcall + 0x6c (0x5603d280ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #26: PyObject_Call + 0xbc (0x5603d2818f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5603d27ff2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #28: _PyFunction_Vectorcall + 0x6c (0x5603d280ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5603d27fd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #30: + 0x150582 (0x5603d2818582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5603d27fd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #32: + 0x150582 (0x5603d2818582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5603d27fd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #34: + 0x150582 (0x5603d2818582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5603d27fd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 187, in train_batch_iter [default6]: output = self.forward(context=context, state=state, micro_batch=micro_batch, model=model) [default7]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x5603d2804f50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #37: _PyObject_Call_Prepend + 0x69 (0x5603d2816c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #38: + 0x211239 (0x5603d28d9239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #39: _PyObject_MakeTpCall + 0x26b (0x5603d2805a6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x5603d28013e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #41: _PyFunction_Vectorcall + 0x6c (0x5603d280ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x5603d27fcc5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #43: _PyFunction_Vectorcall + 0x6c (0x5603d280ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5603d27fd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #45: + 0x150582 (0x5603d2818582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/engine.py", line 44, in forward [default6]: output = model(**micro_batch) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default7]:frame #46: PyObject_Call + 0xbc (0x5603d2818f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 890, in forward [default7]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5603d27ff2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #48: + 0x150582 (0x5603d2818582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #49: PyObject_Call + 0xbc (0x5603d2818f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: sharded_logits = self.model( [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default7]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5603d27ff2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #51: _PyFunction_Vectorcall + 0x6c (0x5603d280ca2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x5603d2805007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 764, in forward [default6]: return self.forward_with_hidden_states(input_ids=input_ids, input_mask=input_mask)[0] [default7]:frame #53: _PyObject_Call_Prepend + 0x69 (0x5603d2816c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #54: + 0x211239 (0x5603d28d9239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/models/llama.py", line 780, in forward_with_hidden_states [default7]:frame #55: PyObject_Call + 0x207 (0x5603d2819067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5603d27ff2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #57: + 0x150582 (0x5603d2818582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5603d27fd8fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #59: + 0x150582 (0x5603d2818582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #60: PyObject_Call + 0xbc (0x5603d2818f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5603d27ff2b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:frame #62: + 0x150582 (0x5603d2818582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]: hidden_encoder_states = encoder_block(**hidden_encoder_states) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl [default6]: return self._call_impl(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl [default6]: return forward_call(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/block.py", line 126, in forward [default7]:frame #63: PyObject_Call + 0xbc (0x5603d2818f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default7]:. This may indicate a possible application crash on rank 0 or a network set up issue. [default6]: new_kwargs[name] = recv_from_pipeline_state_buffer( [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/functional.py", line 117, in recv_from_pipeline_state_buffer [default6]: pipeline_state.run_communication() [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 150, in run_communication [default6]: recv_activation_tensor = recv_activation() [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/state.py", line 31, in __call__ [default6]: return self.p2p.recv_tensors(num_tensors=1, from_rank=self.from_rank)[0] [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 353, in recv_tensors [default6]: buffers, futures = self.irecv_tensors(num_tensors=num_tensors, from_rank=from_rank, tag=tag) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 326, in irecv_tensors [default6]: meta = self._recv_meta(from_rank=from_rank, tag=tag) [default6]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/pipeline_parallel/p2p.py", line 246, in _recv_meta [default6]: dist.recv( [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper [default6]: return func(*args, **kwargs) [default6]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1706, in recv [default6]: pg.recv([tensor], group_src_rank, tag).wait() [default6]:torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '2:3', but store->get('2:3') got error: Connection reset by peer [default6]:Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:670 (most recent call first): [default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0d69d69d87 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libc10.so) [default6]:frame #1: + 0x589518e (0x7f0da1d2318e in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #2: c10d::TCPStore::doWait(c10::ArrayRef, std::chrono::duration >) + 0x360 (0x7f0da1d1d9a0 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #3: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f0da1d1dce2 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #4: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f0da1d1eb11 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0da1cd3f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0da1cd3f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0da1cd3f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #8: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f0da1cd3f81 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #9: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f0d6af11c69 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #10: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector > const&, c10d::OpType, int, bool) + 0x22b (0x7f0d6af18c5b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #11: c10d::ProcessGroupNCCL::recv(std::vector >&, int, int) + 0x550 (0x7f0d6af3bb60 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [default6]:frame #12: + 0x5838439 (0x7f0da1cc6439 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #13: + 0x5843330 (0x7f0da1cd1330 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #14: + 0x58433c5 (0x7f0da1cd13c5 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #15: + 0x4e893cc (0x7f0da13173cc in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #16: + 0x1a08a88 (0x7f0d9de96a88 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #17: + 0x5849a84 (0x7f0da1cd7a84 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #18: + 0x584ed35 (0x7f0da1cdcd35 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so) [default6]:frame #19: + 0xc97eee (0x7f0db458eeee in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:frame #20: + 0x413ea4 (0x7f0db3d0aea4 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/lib/libtorch_python.so) [default6]:frame #21: + 0x1445a6 (0x5597983465a6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #22: _PyObject_MakeTpCall + 0x26b (0x55979833fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #23: + 0x150866 (0x559798352866 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #24: _PyEval_EvalFrameDefault + 0x4c12 (0x55979833b142 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #25: _PyFunction_Vectorcall + 0x6c (0x559798346a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #26: PyObject_Call + 0xbc (0x559798352f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #27: _PyEval_EvalFrameDefault + 0x2d83 (0x5597983392b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #28: _PyFunction_Vectorcall + 0x6c (0x559798346a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #29: _PyEval_EvalFrameDefault + 0x13ca (0x5597983378fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #30: + 0x150582 (0x559798352582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #31: _PyEval_EvalFrameDefault + 0x13ca (0x5597983378fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #32: + 0x150582 (0x559798352582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #33: _PyEval_EvalFrameDefault + 0x13ca (0x5597983378fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #34: + 0x150582 (0x559798352582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #35: _PyEval_EvalFrameDefault + 0x13ca (0x5597983378fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #36: _PyObject_FastCallDictTstate + 0xd0 (0x55979833ef50 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #37: _PyObject_Call_Prepend + 0x69 (0x559798350c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #38: + 0x211239 (0x559798413239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #39: _PyObject_MakeTpCall + 0x26b (0x55979833fa6b in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #40: _PyEval_EvalFrameDefault + 0x4eb6 (0x55979833b3e6 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #41: _PyFunction_Vectorcall + 0x6c (0x559798346a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #42: _PyEval_EvalFrameDefault + 0x72c (0x559798336c5c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #43: _PyFunction_Vectorcall + 0x6c (0x559798346a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #44: _PyEval_EvalFrameDefault + 0x13ca (0x5597983378fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #45: + 0x150582 (0x559798352582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #46: PyObject_Call + 0xbc (0x559798352f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #47: _PyEval_EvalFrameDefault + 0x2d83 (0x5597983392b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #48: + 0x150582 (0x559798352582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #49: PyObject_Call + 0xbc (0x559798352f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #50: _PyEval_EvalFrameDefault + 0x2d83 (0x5597983392b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #51: _PyFunction_Vectorcall + 0x6c (0x559798346a2c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #52: _PyObject_FastCallDictTstate + 0x187 (0x55979833f007 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #53: _PyObject_Call_Prepend + 0x69 (0x559798350c39 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #54: + 0x211239 (0x559798413239 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #55: PyObject_Call + 0x207 (0x559798353067 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #56: _PyEval_EvalFrameDefault + 0x2d83 (0x5597983392b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #57: + 0x150582 (0x559798352582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #58: _PyEval_EvalFrameDefault + 0x13ca (0x5597983378fa in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #59: + 0x150582 (0x559798352582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #60: PyObject_Call + 0xbc (0x559798352f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #61: _PyEval_EvalFrameDefault + 0x2d83 (0x5597983392b3 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #62: + 0x150582 (0x559798352582 in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:frame #63: PyObject_Call + 0xbc (0x559798352f1c in /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10) [default6]:. This may indicate a possible application crash on rank 0 or a network set up issue. [2024-07-06 09:37:43,625] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1423317 closing signal SIGTERM [2024-07-06 09:37:43,626] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1423318 closing signal SIGTERM [2024-07-06 09:37:43,626] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1423319 closing signal SIGTERM [2024-07-06 09:37:43,629] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3623615 closing signal SIGTERM [2024-07-06 09:37:43,629] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3623616 closing signal SIGTERM [2024-07-06 09:37:43,630] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 289155 closing signal SIGTERM [2024-07-06 09:37:43,630] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3623617 closing signal SIGTERM [2024-07-06 09:37:43,630] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 289156 closing signal SIGTERM [2024-07-06 09:37:43,630] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3623618 closing signal SIGTERM [2024-07-06 09:37:43,630] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 289157 closing signal SIGTERM [2024-07-06 09:37:43,630] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3623620 closing signal SIGTERM [2024-07-06 09:37:43,631] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 289158 closing signal SIGTERM [2024-07-06 09:37:43,630] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3623621 closing signal SIGTERM [2024-07-06 09:37:43,631] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 289162 closing signal SIGTERM [2024-07-06 09:37:43,633] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1710735 closing signal SIGTERM [2024-07-06 09:37:43,634] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1710736 closing signal SIGTERM [2024-07-06 09:37:43,634] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1710737 closing signal SIGTERM [2024-07-06 09:37:43,634] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1710738 closing signal SIGTERM [2024-07-06 09:37:43,634] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1710741 closing signal SIGTERM [2024-07-06 09:37:43,633] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 330051 closing signal SIGTERM [2024-07-06 09:37:43,634] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 330052 closing signal SIGTERM [2024-07-06 09:37:43,634] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 330053 closing signal SIGTERM [2024-07-06 09:37:43,634] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 330054 closing signal SIGTERM [2024-07-06 09:37:43,633] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2033598 closing signal SIGTERM [2024-07-06 09:37:43,634] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 330057 closing signal SIGTERM [2024-07-06 09:37:43,633] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 120571 closing signal SIGTERM [2024-07-06 09:37:43,633] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2033599 closing signal SIGTERM [2024-07-06 09:37:43,633] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 120572 closing signal SIGTERM [2024-07-06 09:37:43,633] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2033600 closing signal SIGTERM [2024-07-06 09:37:43,634] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 120573 closing signal SIGTERM [2024-07-06 09:37:43,633] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2033601 closing signal SIGTERM [2024-07-06 09:37:43,634] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 120574 closing signal SIGTERM [2024-07-06 09:37:43,634] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 120578 closing signal SIGTERM [2024-07-06 09:37:43,633] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2033602 closing signal SIGTERM [2024-07-06 09:37:43,639] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 647278 closing signal SIGTERM [2024-07-06 09:37:43,640] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 647279 closing signal SIGTERM [2024-07-06 09:37:43,640] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 647280 closing signal SIGTERM [2024-07-06 09:37:43,640] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 647281 closing signal SIGTERM [2024-07-06 09:37:43,640] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 647283 closing signal SIGTERM [2024-07-06 09:37:43,640] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 647284 closing signal SIGTERM [2024-07-06 09:37:45,151] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 4 (pid: 1710739) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 [2024-07-06 09:37:45,153] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 5 (pid: 2033603) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-06_09:37:43 host : ip-26-0-174-36.ec2.internal rank : 61 (local_rank: 5) exitcode : 1 (pid: 1710740) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-06_09:37:43 host : ip-26-0-174-36.ec2.internal rank : 63 (local_rank: 7) exitcode : 1 (pid: 1710742) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:37:43 host : ip-26-0-174-36.ec2.internal rank : 60 (local_rank: 4) exitcode : 1 (pid: 1710739) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-06_09:37:43 host : ip-26-0-163-147.ec2.internal rank : 30 (local_rank: 6) exitcode : 1 (pid: 2033604) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-06_09:37:43 host : ip-26-0-163-147.ec2.internal rank : 31 (local_rank: 7) exitcode : 1 (pid: 2033605) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:37:43 host : ip-26-0-163-147.ec2.internal rank : 29 (local_rank: 5) exitcode : 1 (pid: 2033603) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ [2024-07-06 09:37:45,244] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1423316) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 [2024-07-06 09:37:45,254] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 4 (pid: 289159) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-06_09:37:43 host : ip-26-0-161-138.ec2.internal rank : 4 (local_rank: 4) exitcode : 1 (pid: 1423320) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-06_09:37:43 host : ip-26-0-161-138.ec2.internal rank : 5 (local_rank: 5) exitcode : 1 (pid: 1423321) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-06_09:37:43 host : ip-26-0-161-138.ec2.internal rank : 6 (local_rank: 6) exitcode : 1 (pid: 1423322) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2024-07-06_09:37:43 host : ip-26-0-161-138.ec2.internal rank : 7 (local_rank: 7) exitcode : 1 (pid: 1423323) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:37:43 host : ip-26-0-161-138.ec2.internal rank : 0 (local_rank: 0) exitcode : 1 (pid: 1423316) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-06_09:37:43 host : ip-26-0-173-246.ec2.internal rank : 53 (local_rank: 5) exitcode : 1 (pid: 289160) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-06_09:37:43 host : ip-26-0-173-246.ec2.internal rank : 54 (local_rank: 6) exitcode : 1 (pid: 289161) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:37:43 host : ip-26-0-173-246.ec2.internal rank : 52 (local_rank: 4) exitcode : 1 (pid: 289159) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ [2024-07-06 09:37:45,347] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 4 (pid: 330055) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 [2024-07-06 09:37:45,388] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-165-131.ec2.internal_329981_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-06_09:37:43 host : ip-26-0-165-131.ec2.internal rank : 37 (local_rank: 5) exitcode : 1 (pid: 330056) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-06_09:37:43 host : ip-26-0-165-131.ec2.internal rank : 39 (local_rank: 7) exitcode : 1 (pid: 330058) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:37:43 host : ip-26-0-165-131.ec2.internal rank : 36 (local_rank: 4) exitcode : 1 (pid: 330055) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: ip-26-0-163-147: task 3: Exited with exit code 1 [2024-07-06 09:37:45,445] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 4 (pid: 120575) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 srun: error: ip-26-0-174-36: task 7: Exited with exit code 1 [2024-07-06 09:37:45,489] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-165-59.ec2.internal_120502_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-06_09:37:43 host : ip-26-0-165-59.ec2.internal rank : 45 (local_rank: 5) exitcode : 1 (pid: 120576) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-06_09:37:43 host : ip-26-0-165-59.ec2.internal rank : 46 (local_rank: 6) exitcode : 1 (pid: 120577) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:37:43 host : ip-26-0-165-59.ec2.internal rank : 44 (local_rank: 4) exitcode : 1 (pid: 120575) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: ip-26-0-173-246: task 6: Exited with exit code 1 [2024-07-06 09:37:45,557] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 4 (pid: 647282) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 [2024-07-06 09:37:45,597] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-161-142.ec2.internal_647209_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ srun: error: ip-26-0-161-138: task 0: Exited with exit code 1 Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:37:43 host : ip-26-0-161-142.ec2.internal rank : 12 (local_rank: 4) exitcode : 1 (pid: 647282) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: ip-26-0-165-131: task 5: Exited with exit code 1 srun: error: ip-26-0-165-59: task 4: Exited with exit code 1 [2024-07-06 09:37:45,748] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 4 (pid: 3623619) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 [2024-07-06 09:37:45,794] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-163-134.ec2.internal_3623546_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-07-06_09:37:43 host : ip-26-0-163-134.ec2.internal rank : 23 (local_rank: 7) exitcode : 1 (pid: 3623622) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:37:43 host : ip-26-0-163-134.ec2.internal rank : 20 (local_rank: 4) exitcode : 1 (pid: 3623619) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: ip-26-0-161-142: task 1: Exited with exit code 1 srun: error: ip-26-0-163-134: task 2: Exited with exit code 1 Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.