Editor’s Note (6/14/2023): We have a new article that reevaluates the cache latency of Navi 31, so please refer to that article for some new latency data.
What's up with the dismally poor dual issue rate? Back in the VLIW5 days, AMD had an average packing rate of 3.4. With RDNA3, I get the impression using that second ALU is like shooting for the moon. The four read ports would restrict dual FMA operations, but I assume it's more than that.
Keep in mind dual issue is only required for wave32 mode. In wave64 mode it'll naturally use the second ALU. Compiler optimization is an issue for wave32 mode, but another way of looking at it is, it can situationally reduce instruction dispatch pressure in very compute bound kernels, and probably has low hardware cost.
Wave 64 removes any dependency issues, but I think the 4 read ports would still be an issue for FMA operations? What I really don't get is why not use wave 64 for everything? Less efficient but up to twice the work done compared to wave 32 (if that second ALU is rarely getting used in wave32).
Wave32 is useful for reducing divergence costs and latency within a single thread, so it's great for smaller, less compute bound dispatches like vertex shaders. For more parallel workloads like pixel shaders, AMD likes to use wave64
What's up with the dismally poor dual issue rate? Back in the VLIW5 days, AMD had an average packing rate of 3.4. With RDNA3, I get the impression using that second ALU is like shooting for the moon. The four read ports would restrict dual FMA operations, but I assume it's more than that.
Keep in mind dual issue is only required for wave32 mode. In wave64 mode it'll naturally use the second ALU. Compiler optimization is an issue for wave32 mode, but another way of looking at it is, it can situationally reduce instruction dispatch pressure in very compute bound kernels, and probably has low hardware cost.
Wave 64 removes any dependency issues, but I think the 4 read ports would still be an issue for FMA operations? What I really don't get is why not use wave 64 for everything? Less efficient but up to twice the work done compared to wave 32 (if that second ALU is rarely getting used in wave32).
Wave32 is useful for reducing divergence costs and latency within a single thread, so it's great for smaller, less compute bound dispatches like vertex shaders. For more parallel workloads like pixel shaders, AMD likes to use wave64