Skip to content

Add _rank_not_in_group to idist #3339

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Mar 12, 2025

Conversation

sadra-barikbin
Copy link
Collaborator

Hi there!

To add _rank_not_in_group to idist.utils.

@github-actions github-actions bot added the module: distributed Distributed module label Feb 28, 2025
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Feb 28, 2025

Thanks for the PR @sadra-barikbin !

@@ -16,6 +16,9 @@
)
from ignite.utils import setup_logger

if has_hvd_support:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this as it was for typing only, right?

@sadra-barikbin
Copy link
Collaborator Author

@vfdev-5 , is XLA's _do_new_group correct?

def _do_new_group(self, ranks: List[int], **kwargs: Any) -> Any:
            return [ranks] # Shouldn't it return `list(ranks)` ?

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Mar 1, 2025

@vfdev-5 , is XLA's _do_new_group correct?

def _do_new_group(self, ranks: List[int], **kwargs: Any) -> Any:
            return [ranks] # Shouldn't it return `list(ranks)` ?

On xla group is a list of list of ranks, for example: https://pytorch.org/xla/release/r2.6/learn/api-guide.html#torch_xla.core.xla_model.all_gather

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Mar 2, 2025

Failing test:

  >       gloo_hvd_executor(_test_distrib_all_gather_group, (device,), np=np, do_init=True)

   [0]<stderr>:  File "/home/runner/work/ignite/ignite/tests/ignite/distributed/utils/__init__.py", line 228, in _test_distrib_all_gather_group
  [0]<stderr>:    res = idist.all_gather(t, group=group)
  [0]<stderr>:          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  [0]<stderr>:  File "/home/runner/work/ignite/ignite/ignite/distributed/utils.py", line 435, in all_gather
  [0]<stderr>:    return _model.all_gather(tensor, group=group)
  [0]<stderr>:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  [0]<stderr>:  File "/home/runner/work/ignite/ignite/ignite/distributed/comp_models/base.py", line 224, in all_gather
  [0]<stderr>:    return self._collective_op(tensor, self._do_all_gather, group=group)
  [1]<stderr>:User function raise error: horovod_torch_allgather_async_torch_LongTensor(): incompatible function arguments. The following argument types are supported:
  [1]<stderr>:    1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: str, arg3: int) -> int

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Mar 3, 2025

We have this method coded incorrectly:

def _do_new_group(self, ranks: List[int], **kwargs: Any) -> Any:
    return hvd.ProcessSet(ranks)

it should be

def _do_new_group(self, ranks: List[int], **kwargs: Any) -> Any:
    return hvd.add_process_set(ranks)

and we need to set HOROVOD_DYNAMIC_PROCESS_SETS=1 env var.

I'll commit some updates to this PR.

@vfdev-5 vfdev-5 force-pushed the Feature-add-rank-not-in-group branch 3 times, most recently from 6e6c0cc to 9971932 Compare March 5, 2025 12:32
@vfdev-5 vfdev-5 force-pushed the Feature-add-rank-not-in-group branch from 9971932 to 947ef84 Compare March 8, 2025 14:18
Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failures seems not to be related to the PR's feature.
Let's fix them later.

@vfdev-5 vfdev-5 merged commit 001e81e into pytorch:master Mar 12, 2025
15 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: distributed Distributed module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants