Nvidia has a long tradition of building giant GPUs.
Thank you for your insightful post, as always!
As indicated in the PTX documentation (https://docs.nvidia.com/cuda/parallel-thread-execution/#integer-arithmetic-instructions-mad ), IMAD and IMAD.WIDE means 32-bit × 32-bit → 32-bit/64-bit integer multiplication, resp.
According to Table 7 in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions , widening multiplication delivers only half the throughput of its non-widening counterpart, so the compiler appears to favor non-widening one for 32-bit address generation for shared space.
Funny how the XTX has an FP64 equivalent to this massive and expensive GPU.
Thank you for your insightful post, as always!
As indicated in the PTX documentation (https://docs.nvidia.com/cuda/parallel-thread-execution/#integer-arithmetic-instructions-mad ), IMAD and IMAD.WIDE means 32-bit × 32-bit → 32-bit/64-bit integer multiplication, resp.
According to Table 7 in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions , widening multiplication delivers only half the throughput of its non-widening counterpart, so the compiler appears to favor non-widening one for 32-bit address generation for shared space.
Funny how the XTX has an FP64 equivalent to this massive and expensive GPU.