Abstract Summary

This paper characterizes prefill and decode bottlenecks across coupled CPU-GPU systems and compares NVIDIA H100 and GH200 behavior. The work isolates where orchestration, memory movement, and kernel execution dominate end-to-end latency.

Research Context

This paper contributes to my research program in LLM inference, GH200, H100. It is part of the broader work on efficient ML systems, hardware-software co-design, and deployment-aware computer architecture.

LLM inferenceGH200H100performance analysis

Related Papers