Discussion about this post

User's avatar
Olstyle's avatar

ML workloads will be really interesting. An ok sized GPU with access to 128GB of RAM could be faster than any other consumer device in cases where those just can't fit the model into memory. Intels B60 showed already that memory alone can make a difference.

Expand full comment
jozsef's avatar

Thank you for the article!

Could you double check the rtx 5070 mobile cache bandwith, i think it's not correct data.

In theory blackwell sm can reach 128 byte/clock cycle from L1. Of course it's the upper limit.

But in the article of rtx pro 6000 blackwell one sm can ca. 100 byte/clock cyle.

If we do the math with this number, and suppose a minimal 1.5 ghz operating frequency and the sm count of 5070M which is 36, than the l1 bandwith should be 36*100*1.5 Gbyte/sec which is 5400 Gbyte/sec. If we suppose 2 ghz opperating frequency than we reach 7200 Gbyte/sec and at 2.5 ghz 9000 Gbyte/ sec. So why is this big difference between your measurement (ca. 3000 Gbyte/sec) and the math. Thank you verry much for the answer, and sorry about my english!

Expand full comment
9 more comments...

No posts