Appendix
Register File Capacity
Vector register file capacity was determined by looking at Freedreno source code.
- https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/freedreno/ir3/ir3_compiler.h?ref_type=heads says register file capacity is computed with “reg_size_vec4 * threadsize_base * wave_granularity * 16 (bytes per vec4)”
- For Adreno 530, reg_size_vec4 (48) is set for the entire 5xx generation at https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/freedreno/ir3/ir3_compiler.c?ref_type=heads#L240, suggesting there is no variation for different SKUs within the 5xx generation. threadsize_base is given just below (32). wave_granularity is 2 from https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/freedreno/common/freedreno_devices.py?ref_type=heads#L281. Multiplying 48 * 32 * 2 * 16 = 49152, or 48 KB
- Adreno 7xx has reg_size_vec4 (64) defined in https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/freedreno/common/freedreno_devices.py?ref_type=heads#L764 and wave_granularity defined below (2). threadsize_base is defined in https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/freedreno/ir3/ir3_compiler.c?ref_type=heads#L247 for 6xx and newer Adreno GPUs (to be 64)
64 * 64 * 2 * 16 = 128 KB, but divide that by 2 because Freedreno code considers 2 SPs to be one
Instruction Cache Capacity
Freedreno also lists instruction cache capacity. It’s set to 127x 128B units in freedreno_devices.py, but there’s a comment just above:
# Blob limits it to 128 but we hang with 128
freedreno_devices.py, a7xx section: https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/freedreno/common/freedreno_devices.py?ref_type=heads#L368
Adreno apparently supports pre-loading a kernel’s instructions to reduce instruction cache warmup time. Normally when a kernel starts, its instructions won’t yet be in the instruction cache. Instruction fetches will miss in the cache, which generates refill requests. After the hot section of the kernel gets filled into the cache, execution won’t be bound by instruction fetch latency anymore. Pre-loading makes that happen faster. Evidently trying to preload up to instruction cache capacity causes a hang on Adreno 7xx, even though the developers saw Qualcomm’s blobs using the full instruction cache size.
Anyway, 128 * 128B = 16 KB. The same file indicates early 6xx Adreno GPUs used 64 * 128B = 8 KB instruction caches, so Adreno 530 might also have that. The instruction cache may be shared across multiple SPs, but there’s no way to know that.
Wave Assignment
Unlike A5xx, where waves are assigned to SP partitions in round-robin order, A7xx (and A6xx) GPUs launch and assign waves to SP partitions on pairs. Odd and even waves thus end up on the same scheduler partition, and share execution resources.
Because of this behavior, Mesa code pretends Adreno 730 operates in wave128 mode. However that’s not the case, as we can see divergence penalties disappear once branch behavior is coherent across 64 threads.