Peer-reviewed paper
Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures
This paper characterizes prefill and decode bottlenecks across coupled CPU-GPU systems and compares NVIDIA H100 and GH200 behavior. The work isolates where orchestration, memory movement, and kernel execution dominate end-to-end latency.
Abstract Summary
This paper characterizes prefill and decode bottlenecks across coupled CPU-GPU systems and compares NVIDIA H100 and GH200 behavior. The work isolates where orchestration, memory movement, and kernel execution dominate end-to-end latency.
Research Context
This paper contributes to my research program in LLM inference, GH200, H100. It is part of the broader work on efficient ML systems, hardware-software co-design, and deployment-aware computer architecture.