Current focus
LLM systems, inference bottlenecks, and accelerator-aware optimization.
PhD Candidate · Carnegie Mellon University
I am a PhD candidate in Electrical & Computer Engineering at CMU, co-advised by Prof. John Paul Shen and Prof. Shawn Blanton. My research spans LLM inference optimization on CPU-GPU coupled architectures, energy-efficient deep learning accelerators, and neuromorphic computing with Temporal Neural Networks. I received the 2023 Qualcomm Innovation Fellowship and the CIT Dean's Fellowship.
Expected graduation: December 2026. Incoming AI Research Scientist Intern at Samsung Semiconductor (Jun 2026 – Sep 2026) and Silicon Solution Engineering Intern at NVIDIA (Mar 2026 – Jun 2026), both offer accepted.
Current focus
LLM systems, inference bottlenecks, and accelerator-aware optimization.
Research footprint
CMU NCAL, CMU ACTL, UCF UNARY, and NEXUS collaborations across 4 research groups.
Teaching & advising
10 teaching semesters and mentoring support in architecture and VLSI workflows.
My work sits at the intersection of systems, architecture, and physical implementation: from LLM inference behavior on coupled CPU-GPU platforms to low-power accelerator design and neuromorphic hardware generation.
Profiling and optimizing LLM workloads on CPU-GPU coupled architectures (H100, GH200). KV cache efficiency, batching strategies, kernel-level bottleneck decomposition.
Custom GEMM units, convolution cores, and MAC architectures targeting edge AI — leveraging unary/binary hybrid arithmetic for area-power-efficiency trade-offs.
Temporal Neural Networks (TNNs), automated RTL-to-GDSII design frameworks, and custom PDK development for neuromorphic sensory processing.
Physical design, floorplanning, clock tree synthesis, DRC/LVS signoff on TSMC N5/N7 and ASAP7 PDK. Hardware-software co-design for AI workloads.
Recent paper acceptances, invited talks, awards, and upcoming industry research appointments.
Representative papers across LLM systems, accelerator architecture, and temporal neuromorphic hardware.
TaxBreak: Unmasking the Hidden Costs of LLM Inference Through Overhead Decomposition
IEEE ISPASS 2026 Accepted
A decomposition of LLM inference overheads that isolates non-matmul costs, exposes where end-to-end latency is lost, and helps redirect optimization effort toward the bottlenecks large-model deployments actually pay for.
Mugi: Value Level Parallelism for Efficient LLMs
ACM ASPLOS 2026 Systems
Generalizes value-level parallelism (VLP) for nonlinear LLM operations and small-batch GEMMs. Up to 45× throughput and 668× energy efficiency for softmax; 2.07× LLM throughput and 3.11× energy efficiency; 1.45× reduction in operational carbon.
Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures
IEEE ISPASS 2025 Invited Talk at Jülich Supercomputing Center
Characterizes prefill/decode bottlenecks on H100 vs GH200: GH200 incurs 2.8× higher prefill latency and 4× larger CPU-bounded region. Samsung-funded ($150K+).
Tempus Core: Area-Power Efficient Temporal-Unary Convolution Core for Low-Precision Edge DLAs
IEEE DATE 2025
INT8 temporal-unary convolution core for NVDLA on 7nm: 53% area reduction, 44% power savings, 5× iso-area throughput improvement.
Catwalk: Unary Top-K for Efficient Ramp-No-Leak Neuron Design for Temporal Neural Networks
IEEE ISVLSI 2025 Best Paper Award
Introduces a unary top-k design for Temporal Neural Networks that improves ramp-no-leak neuron efficiency and earned the Amar Mukherjee Best Paper Award at ISVLSI 2025.