Peer-reviewed paper

Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures

This paper characterizes prefill and decode bottlenecks across coupled CPU-GPU systems and compares NVIDIA H100 and GH200 behavior. The work isolates where orchestration, memory movement, and kernel execution dominate end-to-end latency.

Authors: P. Vellaisamy, T. Labonte, S. Chakraborty, M. Turner, S. Sury, J.P. Shen

Venue: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) 2025

Note: Invited talk at Jülich Supercomputing Center

Back to Publications

Abstract Summary

Research Context

This paper contributes to my research program in LLM inference, GH200, H100. It is part of the broader work on efficient ML systems, hardware-software co-design, and deployment-aware computer architecture.

LLM inferenceGH200H100performance analysis

Abstract Summary

Research Context

Related Papers