Skip to content

[BUG]TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object #13574

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task done
xiaoxiaodecheng opened this issue Apr 24, 2025 · 2 comments
Open
1 task done
Labels
detect Object Detection issues, PR's question Further information is requested

Comments

@xiaoxiaodecheng
Copy link

Search before asking

Question

I am using DeepSpeed to accelerate YOLOv5 training and have added the corresponding steps in the training file
Below is the code I added to the training file and my DeepSpeed configuration file

model,optimizer,, = deepspeed.initialize( args=opt, model=model, optimizer=optimizer, model_parameters=model.parameters(), config=opt.deepspeed_config_file )

ds_config.json

{ "train_batch_size": 16, "gradient_accumulation_steps": 1, "zero_optimization": { "stage": 0 }, "zero_allow_untested_optimizer": true }

However, I encountered the following error during execution and am not sure how to resolve it

ep_module_on_host=False replace_with_kernel_inject=False
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] timers_config ................ enabled=True synchronized=True
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] train_batch_size ............. 16
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] train_micro_batch_size_per_gpu 16
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] use_data_before_expert_parallel_ False
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] use_node_local_storage ....... False
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] wall_clock_breakdown ......... False
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] weight_quantization_config ... None
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] world_size ................... 1
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] zero_allow_untested_optimizer True
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True log_trace_cache_warnings=False
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] zero_enabled ................. False
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] zero_force_ds_cpu_optimizer .. True
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] zero_optimization_stage ...... 0
[2025-04-24 20:06:55,076] [INFO] [config.py:993:print_user_config] json = {
"train_batch_size": 16,
"gradient_accumulation_steps": 1,
"zero_optimization": {
"stage": 0
},
"zero_allow_untested_optimizer": true
}
Traceback (most recent call last):
File "/home/admslc/code/yolov5/train.py", line 986, in
main(opt)
File "/home/admslc/code/yolov5/train.py", line 854, in main
train(opt.hyp, opt, device, callbacks)
File "/home/admslc/code/yolov5/train.py", line 260, in train
ema = ModelEMA(model) if RANK in {-1, 0} else None
File "/home/admslc/code/yolov5/utils/torch_utils.py", line 412, in init
self.ema = deepcopy(de_parallel(model)).eval() # FP32 EMA
File "/home/admslc/anaconda3/envs/env-py310/lib/python3.10/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/home/admslc/anaconda3/envs/env-py310/lib/python3.10/copy.py", line 271, in _reconstruct
state = deepcopy(state, memo)
File "/home/admslc/anaconda3/envs/env-py310/lib/python3.10/copy.py", line 146, in deepcopy
y = copier(x, memo)
File "/home/admslc/anaconda3/envs/env-py310/lib/python3.10/copy.py", line 231, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/home/admslc/anaconda3/envs/env-py310/lib/python3.10/copy.py", line 161, in deepcopy
rv = reductor(4)
TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object
[2025-04-24 20:06:54,411] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 149527

Additional

No response

@xiaoxiaodecheng xiaoxiaodecheng added the question Further information is requested label Apr 24, 2025
@UltralyticsAssistant UltralyticsAssistant added the detect Object Detection issues, PR's label Apr 24, 2025
@UltralyticsAssistant
Copy link
Member

👋 Hello @xiaoxiaodecheng, thank you for reporting this issue with YOLOv5 and DeepSpeed integration 🚀! This is an automated response to help get your issue addressed as quickly as possible. An Ultralytics engineer will review your report and assist you soon.

To better assist you, could you please provide a minimum reproducible example (MRE) that demonstrates the error? This helps us replicate the issue and provide a more accurate solution.

In the meantime, please review the following resources to ensure your setup aligns with our recommendations:

Requirements

Python>=3.8.0 with all requirements.txt installed including PyTorch>=1.8. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 can be run in any of the following verified, up-to-date environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

YOLOv5 CI

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

If you have not already, please ensure you are following our Tips for Best Training Results.

Thank you for your patience and for helping improve YOLOv5! 🛠️

@pderrenger
Copy link
Member

Hi @xiaoxiaodecheng,

This error occurs because the DeepSpeed-wrapped model contains non-picklable objects (specifically the ProcessGroup) which can't be deep-copied when creating the ModelEMA.

To fix this, you need to modify your code to extract the underlying model before creating the EMA:

# Instead of this:
ema = ModelEMA(model) if RANK in {-1, 0} else None

# Use this:
if RANK in {-1, 0}:
    # Get the underlying model from DeepSpeed's engine
    unwrapped_model = model.module if hasattr(model, "module") else model
    ema = ModelEMA(unwrapped_model)
else:
    ema = None

This extracts the base model without the non-picklable DeepSpeed components, allowing ModelEMA to work correctly. You might also need similar adjustments in other parts of the code where model deep copying happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
detect Object Detection issues, PR's question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants