DeepSeek's AI breakthrough bypasses industry-standard CUDA, uses assembly-like PTX programming instead

vegeta@lemmy.world · 5 months ago

DeepSeek's AI breakthrough bypasses industry-standard CUDA, uses assembly-like PTX programming instead

KingRandomGuy@lemmy.world · edit-2 5 months ago

What I’m curious to see is how well these types of modifications scale with compute. DeepSeek is restricted to H800s instead of H100s or H200. These are gimped cards to get around export controls, and accordingly they have lower memory bandwidth (~2 vs ~3 TB/s) and most notably, much slower GPU to GPU communication (something like 400 GB/s vs 900 GB/s). The specific reason they used PTX in this application was to help alleviate some of the bottlenecks due to the limited inter-GPU bandwidth, so I wonder if that would still improve performance on H100 and H200 GPUs where bandwidth is much higher.

DeepSeek's AI breakthrough bypasses industry-standard CUDA, uses assembly-like PTX programming instead

DeepSeek's AI breakthrough bypasses industry-standard CUDA, uses assembly-like PTX programming instead

DeepSeek's AI breakthrough bypasses industry-standard CUDA, uses Nvidia's assembly-like PTX programming instead