Research | Prabhu Vellaisamy

Doctoral group work

PhD Research Projects

All projects below are part of my doctoral research at CMU NCAL / CMU ACTL / UCF UNARY / NEXUS under the supervision of Prof. J.P. Shen and Prof. Shawn Blanton, in collaboration across four research groups.

Mugi — Value-Level Parallelism for Efficient LLMs

ASPLOS 2026

Co-invented Mugi, a technique that exploits value-level parallelism in transformer nonlinear operations (softmax, LayerNorm, top-K) to accelerate LLM inference. Mugi reduces memory footprint while improving throughput on memory-bound workloads.

Key results: Up to 45× throughput & 668× energy efficiency for softmax; 2.07× LLM throughput and 3.11× energy efficiency end-to-end; 1.45× reduction in operational carbon and 1.48× embodied carbon. Outperforms existing nonlinear approximations in accuracy, performance, and efficiency.

LLM Inference Value Parallelism Transformers ASPLOS 2026

Tempus Core — Temporal-Unary Convolution Core for Edge DLAs

DATE 2025

Architected an INT8 temporal-unary convolution core for NVDLA, targeting low-precision edge deep learning inference. Full physical design flow on 7nm process technology: floorplanning, clock tree synthesis (CTS), place-and-route, DRC/LVS signoff.

Key results: 53% area reduction, 44% power savings, 5× iso-area throughput improvement over NVDLA baseline.

NVDLA 7nm INT8 Physical Design CTS DATE 2025

TNNGen — Automated Neuromorphic SPU Design Framework

ISCAS 2024 · TCAS-II 2024

Developed TNNGen, an automation framework that compiles PyTorch Temporal Neural Network (TNN) models to DRC/LVS-clean post-layout netlists. Validated across 7 application modalities (audio, EEG, gesture, etc.).

Key result: Reduces TNN hardware design time from weeks to under 2 hours. Selected for journal publication in IEEE TCAS-II 2024.

TNN RTL Automation PyTorch GDSII ISCAS 2024 TCAS-II 2024

TNN7 — Custom 7nm PDK Extension for Neuromorphic TNNs

ISVLSI 2022

Devised TNN7: a custom predictive 7nm open-source PDK extension (ASAP7) comprising 9 custom hard macros for Temporal Neural Networks. Used by 3 research groups.

Key results: 14% power, 16% delay, 28% area, 45% EDP reduction over baseline ASAP7 designs.

ASAP7 Custom PDK Hard Macros TNN ISVLSI 2022

Internship research

Industry Research Contributions

These projects were developed through research internships and industry collaborations, with an emphasis on deployment-facing performance characterization and manufacturable low-power accelerator designs.

LLM Inference Profiling on CPU-GPU Coupled Architectures

ISPASS 2025 · Samsung-Funded

Built SKIP — a PyTorch-based profiling tool for operator-kernel dynamics in LLM inference, characterizing KV cache efficiency and GPU memory bandwidth utilization on NVIDIA H100 and GH200 Grace Hopper systems. Spearheaded a 5-person CMU–Samsung research collaboration from concept to publication.

Key results: GH200 incurs 2.8× more prefill latency and 4× larger CPU-bounded region vs PCIe H100 due to Grace CPU inefficiencies. Findings were shared with Samsung accelerator design efforts. Project funded at $150K+.

LLM Profiling H100 / GH200 KV Cache PyTorch CUDA Kernels ISPASS 2025

tubGEMM — Temporal-Unary-Binary GEMM Unit

ISVLSI 2023 · MediaTek

Devised an ultra-low-power hybrid temporal-unary-binary GEMM unit for edge AI. Fabricated on TSMC N5 process technology during internship at MediaTek USA. Adopted for further development within MediaTek.

Key result: 40%+ power reduction versus conventional binary GEMM at same throughput.

GEMM Unary Computing TSMC N5 Edge AI ISVLSI 2023

OzMAC — Sparsity-Exploiting MAC Unit for DL Inference

VLSI-SoC 2024 · MediaTek

Designed OzMAC, a bit-serial, sparsity-exploiting multiply-accumulate unit. Achieved full timing closure on TSMC N5. Adopted for further development at MediaTek.

MAC Sparsity TSMC N5 VLSI-SoC 2024

Talks & Presentations

Invited Talk May 20, 2025

"Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures"

Jülich Supercomputing Center, Forschungszentrum Jülich, Germany (Remote)

Conference May 12, 2025

"Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures"

ISPASS 2025, Ghent, Belgium

Conference April 1, 2025

"Tempus Core: Area-Power Efficient Temporal-Unary Convolution Core for Low-Precision Edge DLAs"

DATE 2025, Lyon, France

Conference October 7, 2024

"OzMAC: An Energy-Efficient Sparsity-Exploiting Multiply-Accumulate-Unit Design for DL Inference"

VLSI-SoC 2024, Tangier, Morocco

Conference July 2, 2024

"Exploration of Unary Arithmetic-Based Matrix Multiply Units for Low Precision DL Accelerators"

ISVLSI 2024, Knoxville, TN

Conference May 21, 2024

"TNNGen: Automated Design of Neuromorphic Sensory Processing Units for Time-Series Clustering"

ISCAS 2024, Singapore

Fellowships & Awards

🏆

Amar Mukherjee Best Paper Award

ISVLSI 2025

🏅

Qualcomm Innovation Fellowship — Winner

North America, 2023

🎓

CIT Dean's Fellowship

Carnegie Mellon University doctoral fellowship

⭐

Exemplary Performance Award

MediaTek USA Inc. — Innovative contribution during AI Architecture internship

✈️

ISVLSI 2024 Travel Grant

IEEE ISVLSI 2024

📋

CMU GSA Conference Grant

Carnegie Mellon University Graduate Student Assembly

⚙️

DAC Young Fellow

Design Automation Conference, 2022 — Top early-career EDA researchers

🏛️

ASPLOS Young Architect

ASPLOS 2022

Professional Service

Peer Reviewer: IEEE Transactions on VLSI Systems (TVLSI) · IEEE Journal of Exploratory Solid-State Computational Devices and Circuits (JXCDC)

Memberships: IEEE-Eta Kappa Nu (HKN) · Sigma Xi Scientific Research Honor Society