You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have searched the YOLOv5 issues and discussions and found no similar questions.
Question
I am using DeepSpeed to accelerate YOLOv5 training and have added the corresponding steps in the training file
Below is the code I added to the training file and my DeepSpeed configuration file
👋 Hello @xiaoxiaodecheng, thank you for reporting this issue with YOLOv5 and DeepSpeed integration 🚀! This is an automated response to help get your issue addressed as quickly as possible. An Ultralytics engineer will review your report and assist you soon.
To better assist you, could you please provide a minimum reproducible example (MRE) that demonstrates the error? This helps us replicate the issue and provide a more accurate solution.
In the meantime, please review the following resources to ensure your setup aligns with our recommendations:
YOLOv5 can be run in any of the following verified, up-to-date environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on macOS, Windows, and Ubuntu every 24 hours and on every commit.
This error occurs because the DeepSpeed-wrapped model contains non-picklable objects (specifically the ProcessGroup) which can't be deep-copied when creating the ModelEMA.
To fix this, you need to modify your code to extract the underlying model before creating the EMA:
# Instead of this:ema=ModelEMA(model) ifRANKin {-1, 0} elseNone# Use this:ifRANKin {-1, 0}:
# Get the underlying model from DeepSpeed's engineunwrapped_model=model.moduleifhasattr(model, "module") elsemodelema=ModelEMA(unwrapped_model)
else:
ema=None
This extracts the base model without the non-picklable DeepSpeed components, allowing ModelEMA to work correctly. You might also need similar adjustments in other parts of the code where model deep copying happens.
Search before asking
Question
I am using DeepSpeed to accelerate YOLOv5 training and have added the corresponding steps in the training file
Below is the code I added to the training file and my DeepSpeed configuration file
model,optimizer,, = deepspeed.initialize( args=opt, model=model, optimizer=optimizer, model_parameters=model.parameters(), config=opt.deepspeed_config_file )
ds_config.json
{ "train_batch_size": 16, "gradient_accumulation_steps": 1, "zero_optimization": { "stage": 0 }, "zero_allow_untested_optimizer": true }
However, I encountered the following error during execution and am not sure how to resolve it
ep_module_on_host=False replace_with_kernel_inject=False
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] timers_config ................ enabled=True synchronized=True
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] train_batch_size ............. 16
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] train_micro_batch_size_per_gpu 16
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] use_data_before_expert_parallel_ False
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] use_node_local_storage ....... False
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] wall_clock_breakdown ......... False
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] weight_quantization_config ... None
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] world_size ................... 1
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] zero_allow_untested_optimizer True
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True log_trace_cache_warnings=False
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] zero_enabled ................. False
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] zero_force_ds_cpu_optimizer .. True
[2025-04-24 20:06:55,076] [INFO] [config.py:1007:print] zero_optimization_stage ...... 0
[2025-04-24 20:06:55,076] [INFO] [config.py:993:print_user_config] json = {
"train_batch_size": 16,
"gradient_accumulation_steps": 1,
"zero_optimization": {
"stage": 0
},
"zero_allow_untested_optimizer": true
}
Traceback (most recent call last):
File "/home/admslc/code/yolov5/train.py", line 986, in
main(opt)
File "/home/admslc/code/yolov5/train.py", line 854, in main
train(opt.hyp, opt, device, callbacks)
File "/home/admslc/code/yolov5/train.py", line 260, in train
ema = ModelEMA(model) if RANK in {-1, 0} else None
File "/home/admslc/code/yolov5/utils/torch_utils.py", line 412, in init
self.ema = deepcopy(de_parallel(model)).eval() # FP32 EMA
File "/home/admslc/anaconda3/envs/env-py310/lib/python3.10/copy.py", line 172, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/home/admslc/anaconda3/envs/env-py310/lib/python3.10/copy.py", line 271, in _reconstruct
state = deepcopy(state, memo)
File "/home/admslc/anaconda3/envs/env-py310/lib/python3.10/copy.py", line 146, in deepcopy
y = copier(x, memo)
File "/home/admslc/anaconda3/envs/env-py310/lib/python3.10/copy.py", line 231, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/home/admslc/anaconda3/envs/env-py310/lib/python3.10/copy.py", line 161, in deepcopy
rv = reductor(4)
TypeError: cannot pickle 'torch._C._distributed_c10d.ProcessGroup' object
[2025-04-24 20:06:54,411] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 149527
Additional
No response
The text was updated successfully, but these errors were encountered: