Add `_rank_not_in_group` to idist #3339

sadra-barikbin · 2025-02-28T15:55:18Z

Hi there!

To add _rank_not_in_group to idist.utils.

ignite/distributed/utils.py

vfdev-5 · 2025-02-28T16:33:35Z

Thanks for the PR @sadra-barikbin !

vfdev-5 · 2025-02-28T16:48:36Z

ignite/distributed/utils.py

@@ -16,6 +16,9 @@
 )
 from ignite.utils import setup_logger

+if has_hvd_support:


Let's remove this as it was for typing only, right?

ignite/distributed/utils.py

sadra-barikbin · 2025-03-01T10:34:23Z

@vfdev-5 , is XLA's _do_new_group correct?

def _do_new_group(self, ranks: List[int], **kwargs: Any) -> Any:
            return [ranks] # Shouldn't it return `list(ranks)` ?

ignite/distributed/comp_models/horovod.py

ignite/distributed/comp_models/xla.py

ignite/distributed/utils.py

vfdev-5 · 2025-03-01T21:50:40Z

@vfdev-5 , is XLA's _do_new_group correct?

def _do_new_group(self, ranks: List[int], **kwargs: Any) -> Any:
            return [ranks] # Shouldn't it return `list(ranks)` ?

On xla group is a list of list of ranks, for example: https://pytorch.org/xla/release/r2.6/learn/api-guide.html#torch_xla.core.xla_model.all_gather

vfdev-5 · 2025-03-02T10:59:22Z

Failing test:

  >       gloo_hvd_executor(_test_distrib_all_gather_group, (device,), np=np, do_init=True)

   [0]<stderr>:  File "/home/runner/work/ignite/ignite/tests/ignite/distributed/utils/__init__.py", line 228, in _test_distrib_all_gather_group
  [0]<stderr>:    res = idist.all_gather(t, group=group)
  [0]<stderr>:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  [0]<stderr>:  File "/home/runner/work/ignite/ignite/ignite/distributed/utils.py", line 435, in all_gather
  [0]<stderr>:    return _model.all_gather(tensor, group=group)
  [0]<stderr>:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  [0]<stderr>:  File "/home/runner/work/ignite/ignite/ignite/distributed/comp_models/base.py", line 224, in all_gather
  [0]<stderr>:    return self._collective_op(tensor, self._do_all_gather, group=group)
  [1]<stderr>:User function raise error: horovod_torch_allgather_async_torch_LongTensor(): incompatible function arguments. The following argument types are supported:
  [1]<stderr>:    1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: str, arg3: int) -> int

vfdev-5 · 2025-03-03T11:49:02Z

We have this method coded incorrectly:

def _do_new_group(self, ranks: List[int], **kwargs: Any) -> Any:
    return hvd.ProcessSet(ranks)

it should be

def _do_new_group(self, ranks: List[int], **kwargs: Any) -> Any:
    return hvd.add_process_set(ranks)

and we need to set HOROVOD_DYNAMIC_PROCESS_SETS=1 env var.

I'll commit some updates to this PR.

vfdev-5

Failures seems not to be related to the PR's feature.
Let's fix them later.

github-actions bot added the module: distributed Distributed module label Feb 28, 2025

sadra-barikbin requested a review from vfdev-5 February 28, 2025 15:55

vfdev-5 reviewed Feb 28, 2025

View reviewed changes

ignite/distributed/utils.py Outdated Show resolved Hide resolved

ignite/distributed/utils.py Outdated Show resolved Hide resolved

vfdev-5 reviewed Feb 28, 2025

View reviewed changes

ignite/distributed/utils.py Outdated Show resolved Hide resolved

vfdev-5 reviewed Feb 28, 2025

View reviewed changes

ignite/distributed/utils.py Outdated Show resolved Hide resolved

vfdev-5 reviewed Mar 1, 2025

View reviewed changes

ignite/distributed/comp_models/horovod.py Outdated Show resolved Hide resolved

ignite/distributed/comp_models/xla.py Outdated Show resolved Hide resolved

vfdev-5 reviewed Mar 1, 2025

View reviewed changes

ignite/distributed/utils.py Show resolved Hide resolved

vfdev-5 force-pushed the Feature-add-rank-not-in-group branch 3 times, most recently from 6e6c0cc to 9971932 Compare March 5, 2025 12:32

sadra-barikbin added 14 commits March 8, 2025 14:42

Add _rank_not_in_group

539593b

Fix the function

dca248c

Fix typehint

2e02dcc

Fix the function

042826d

Improve the function

8f0ca96

Fix tests

3e31fcd

Fix tests

f89fe29

Fix the function

8ba9cbe

Add _rank_not_in_group method to CompModel

f278dd1

Add _rank_not_in_group method to CompModel

117a31c

Add group option to hvd allgather

434103a

Add group option to hvd allgather

96cd0ec

Fix a bug

e87d280

Fix a bug

0811938

sadra-barikbin and others added 3 commits March 8, 2025 14:42

Remove pytest assertion on hvd's not having allgather with group

e5f5fd4

More fixes for hvd

7f1eaa1

few more hvd fixes

947ef84

vfdev-5 force-pushed the Feature-add-rank-not-in-group branch from 9971932 to 947ef84 Compare March 8, 2025 14:18

Fixed _test_distrib_group for native dist config

275ad2f

vfdev-5 approved these changes Mar 12, 2025

View reviewed changes

vfdev-5 merged commit 001e81e into pytorch:master Mar 12, 2025
15 of 20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add `_rank_not_in_group` to idist #3339

Add `_rank_not_in_group` to idist #3339

Uh oh!

sadra-barikbin commented Feb 28, 2025

Uh oh!

Uh oh!

Uh oh!

vfdev-5 commented Feb 28, 2025

Uh oh!

vfdev-5 Feb 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sadra-barikbin commented Mar 1, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vfdev-5 commented Mar 1, 2025

Uh oh!

vfdev-5 commented Mar 2, 2025

Uh oh!

vfdev-5 commented Mar 3, 2025 •

edited

Loading

Uh oh!

vfdev-5 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add _rank_not_in_group to idist #3339

Add _rank_not_in_group to idist #3339

Uh oh!

Conversation

sadra-barikbin commented Feb 28, 2025

Uh oh!

Uh oh!

Uh oh!

vfdev-5 commented Feb 28, 2025

Uh oh!

vfdev-5 Feb 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sadra-barikbin commented Mar 1, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vfdev-5 commented Mar 1, 2025

Uh oh!

vfdev-5 commented Mar 2, 2025

Uh oh!

vfdev-5 commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vfdev-5 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Add `_rank_not_in_group` to idist #3339

Add `_rank_not_in_group` to idist #3339

vfdev-5 commented Mar 3, 2025 •

edited

Loading