13 Comments

Very well written! Hopefully, these insights are allowed past the marketing department, managers, and other bureaucratic mechanisms that "protect" the developers. Hopefully, the driver team has enough bandwidth to address this, compared to working on new features, new hardware, or fixing other bugs.

Some of the issues you can see in the graphs indicate architectural issues, such as batching strategy and DMA chunk sizes. Other things could be hardware choices, such implementing less of the DMA mastering on the GPU side (i.e., using Windows paging to transfer data to the GPU instead of sending a physical address to the GPU DMA engine).

We can always speculate on the thoughts behind the implementation, but the results clearly show room for improvement. In the mean time, great job on creating something that is close the ideal bug report: specific, data-driven, repeatable, and fully described.

Expand full comment

That ISR/DPC graph is crazy. Spending 1/3 of a CPU-second in ISRs for every 1 second of wall clock time running a single threaded DX11 application is insane. Even if they moved that into a DPC, that's still going to be disruptive to other latency-sensitive applications on the system. I'd be curious to know how evenly distributed they are across CPUs and whether it's a lot of short ISRs or a few long ones that add up.

It's unlikely that it's contributing significantly to the poor 3DMark API Overhead test results at these kind of frame rates, but it's certainly interesting that Intel can't seem to do Independent Flip on Vulkan swapchains. Having to context switch to DWM to present each frame isn't free and almost certainly is costing them significant performance in very high frame rate Vulkan applications. Nvidia's driver can also directly present Vulkan swapchains like you saw on the AMD test.

Expand full comment

wonderful write-up. much more informative than your youtube brethren's attempts.

Expand full comment

Apparently, the new AMD drivers became worse for Directx 11 in your tests, on the forums of the RX 6900 Xt, more than 4 millions scored in the API 3DMark API 1-1.5 years ago on the Intel12600K and AMD 7600X processor.

Expand full comment

Now I am also curious if and how performance of the new RDNA4 GPUs will be affected by the generational age of the CPU (and BIOS) those are paired with. If (when) you test one of those, please test it also in an older system.

Expand full comment

Some of this might simply be due to Intel putting most of their efforts where they will bring the greatest benefits to likely buyers of their B580. And even without having an actual marketing study in front of me, I'd strongly guess that their most likely buyer of a B580 is someone with a more recent CPU and BIOS than the ones affected by the issue described here. Intel's graphics division #1 goal right now has to be to keep the positive momentum going that the overall positive reception of the first Battlemage dGPU gave them. That means focus on ironing out any remaining driver issues with the top 30-50 games and some key apps. The honeymoon period will be over very soon, as AMD is rolling out RDNA4 and Nvidia is about to launch their 5060.

Expand full comment

That's a likely explanation. Equally likely is that reviewers tend to test GPUs with rather powerful CPUs, so that's what Intel targeted. Buyers with older CPUs may be less common, but I wouldn't be surprised if there's a significant minority of people out there with a Ryzen 5 2600 or i5-9600. I think pushing people to upgrade by scaling badly to older hardware isn't great, and in this case, it probably wasn't intentional either.

Expand full comment

I agree that it's certainly not a good thing to limit the likely benefit of any new hardware to buyers who already have more modern (capable) gear. I also agree that it's unlikely that Intel chose to willfully ignore users with older CPUs and BIOSs when they developed Battlemage. In addition to the points you and others already mentioned, there is another possible explanation (pure speculation on my part!): the first Battlemage GPU was (is) the iGPU in Lunar Lake, which was paired with a new CPU generation, and probably optimized for that setting. As the the basic architecture of the B580 is very much the same as that of the Xe2 iGPU in Lunar Lake (of course a lot bigger and with dedicated VRAM), making it work well with CPUs without rBAR simply wasn't on their radar.

In an ironic twist, it's probably easier for owners of an older AMD PC to benefit from a B580 than those with an older Intel system.

Expand full comment
Jan 9Edited

What does "don't excessively rely on PCIe Resizeable BAR" mean?

Resizable BAR is like 10 years overdue. The idea that you need to go through a driver path to get access to the whole of shared memory is absurd. Why is it excessive to expect the BIOS to set the BAR to cover the whole shared region and leave that small-BAR nonsense in the past?

What BAR size did you use for your testing? Did a large BAR change the CPU dependency?

Expand full comment

I think it's more complicated than that. Sure, resizeable BAR has been in the spec for a very long time, but it wasn't a big deal in the consumer space until AMD pushed it with RDNA 2 in 2021. Even then, you didn't lose a lot of performance if your platform couldn't support resizeable BAR. The same applies to Nvidia's RTX 3000 series. Resizeable BAR on consumer platforms is really a recent thing, only relevant (but not particularly important) to the last couple GPU generations.

Also you can have the GPU copy data from host memory to VRAM, rather than directly having the CPU access VRAM through the BAR range.

I had resizeable BAR working for testing, checked through memory ranges in device manager (the allocated address space is large enough to cover VRAM for both cards)

Expand full comment

Well thanks for this. Iv been trying to figure out this 2days ago.

I found that INTEL has 3 pre-rendered frames as default vs amd 0 pre-rendered frames.

There is a setting in regedit called "HwQueuedRenderPacketGroupLimitPerNode" set to 3.

IF we talking DX It would be equivalent to flip buffer size or flip queue.

Changing this setting DOES nothing atm. Wonder if disabling intels HW flip and enabling legacy software flip set to 1-2 would reduce the overhead.

Expand full comment

It is sort of surprising that it works so bad in vk. d3d<=11 doesn't really matter as there is alternatively like dxvk.

Expand full comment

DXVK won't help in a case where Vulkan works badly as you mention.

AFAIR, Intel was wrapping DX-Versions <12 into DX12 for Arc, so that may also be one source of overhead for these types of calls.

Expand full comment