[BUG]TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object #13574

xiaoxiaodecheng · 2025-04-24T12:53:26Z

Search before asking

I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

I am using DeepSpeed to accelerate YOLOv5 training and have added the corresponding steps in the training file
Below is the code I added to the training file and my DeepSpeed configuration file

model,optimizer,, = deepspeed.initialize( args=opt, model=model, optimizer=optimizer, model_parameters=model.parameters(), config=opt.deepspeed_config_file )

ds_config.json

{ "train_batch_size": 16, "gradient_accumulation_steps": 1, "zero_optimization": { "stage": 0 }, "zero_allow_untested_optimizer": true }

However, I encountered the following error during execution and am not sure how to resolve it

ep_module_on_host=False replace_with_kernel_inject=False
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] timers_config ................ enabled=True synchronized=True
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] train_batch_size ............. 16
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] train_micro_batch_size_per_gpu 16
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] use_data_before_expert_parallel_ False
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] use_node_local_storage ....... False
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] wall_clock_breakdown ......... False
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] weight_quantization_config ... None
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] world_size ................... 1
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] zero_allow_untested_optimizer True
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True log_trace_cache_warnings=False
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] zero_enabled ................. False
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] zero_force_ds_cpu_optimizer .. True
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] zero_optimization_stage ...... 0
[2025-04-24 20:06:55,076] [INFO] [config.py:993:print_user_config] json = {
"train_batch_size": 16,
"gradient_accumulation_steps": 1,
"zero_optimization": {
"stage": 0
},
"zero_allow_untested_optimizer": true
}
Traceback (most recent call last):
File "/home/admslc/code/yolov5/train.py", line 986, in
main(opt)
File "/home/admslc/code/yolov5/train.py", line 854, in main
train(opt.hyp, opt, device, callbacks)
File "/home/admslc/code/yolov5/train.py", line 260, in train
ema = ModelEMA(model) if RANK in {-1, 0} else None
File "/home/admslc/code/yolov5/utils/torch_utils.py", line 412, in init
self.ema = deepcopy(de_parallel(model)).eval() # FP32 EMA
File "/home/admslc/anaconda3/envs/env-py310/lib/python3.10/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/home/admslc/anaconda3/envs/env-py310/lib/python3.10/copy.py", line 271, in _reconstruct
state = deepcopy(state, memo)
File "/home/admslc/anaconda3/envs/env-py310/lib/python3.10/copy.py", line 146, in deepcopy
y = copier(x, memo)
File "/home/admslc/anaconda3/envs/env-py310/lib/python3.10/copy.py", line 231, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/home/admslc/anaconda3/envs/env-py310/lib/python3.10/copy.py", line 161, in deepcopy
rv = reductor(4)
TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object
[2025-04-24 20:06:54,411] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 149527

Additional

No response

UltralyticsAssistant · 2025-04-24T12:53:52Z

👋 Hello @xiaoxiaodecheng, thank you for reporting this issue with YOLOv5 and DeepSpeed integration 🚀! This is an automated response to help get your issue addressed as quickly as possible. An Ultralytics engineer will review your report and assist you soon.

To better assist you, could you please provide a minimum reproducible example (MRE) that demonstrates the error? This helps us replicate the issue and provide a more accurate solution.

In the meantime, please review the following resources to ensure your setup aligns with our recommendations:

Requirements

Python>=3.8.0 with all requirements.txt installed including PyTorch>=1.8. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 can be run in any of the following verified, up-to-date environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

If you have not already, please ensure you are following our Tips for Best Training Results.

Thank you for your patience and for helping improve YOLOv5! 🛠️

pderrenger · 2025-04-25T01:47:21Z

Hi @xiaoxiaodecheng,

This error occurs because the DeepSpeed-wrapped model contains non-picklable objects (specifically the ProcessGroup) which can't be deep-copied when creating the ModelEMA.

To fix this, you need to modify your code to extract the underlying model before creating the EMA:

# Instead of this:
ema = ModelEMA(model) if RANK in {-1, 0} else None

# Use this:
if RANK in {-1, 0}:
    # Get the underlying model from DeepSpeed's engine
    unwrapped_model = model.module if hasattr(model, "module") else model
    ema = ModelEMA(unwrapped_model)
else:
    ema = None

This extracts the base model without the non-picklable DeepSpeed components, allowing ModelEMA to work correctly. You might also need similar adjustments in other parts of the code where model deep copying happens.

xiaoxiaodecheng added the question Further information is requested label Apr 24, 2025

UltralyticsAssistant added the detect Object Detection issues, PR's label Apr 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BUG]TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object #13574

[BUG]TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object #13574

xiaoxiaodecheng commented Apr 24, 2025

UltralyticsAssistant commented Apr 24, 2025

Uh oh!

pderrenger commented Apr 25, 2025

Uh oh!

Uh oh!

[BUG]TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object #13574

[BUG]TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object #13574

Comments

xiaoxiaodecheng commented Apr 24, 2025

Search before asking

Question

Additional

UltralyticsAssistant commented Apr 24, 2025

Requirements

Environments

Status

Uh oh!

pderrenger commented Apr 25, 2025

Uh oh!