The documentation at https://pytorch.org/docs/stable/distributed.html specifies that BAND, BOR, BXOR are supported reduction operators, however, they do not work with
all_reduce using the NCCL backend.
We can see in the code that there is no mapping for the bitwise operators, and we use this mapping to get the nccl operation to run. What happens when the mapping is not specified is that the map attempts to default construct a
ncclRedOp_t type (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/types.html#ncclredop-t) and ends up incorrectly mapping these reduction types to
ncclSum. This will mean that if we use these bitwise reduction ops we will just end up doing a sum.