Doctoral work
Research
LLM inference optimization, deep learning accelerators, neuromorphic computing, and unary arithmetic for efficient hardware systems.
Doctoral group work
PhD Research Projects
All projects below are part of my doctoral research at CMU NCAL / CMU ACTL / UCF UNARY / NEXUS under the supervision of Prof. J.P. Shen and Prof. Shawn Blanton, in collaboration across four research groups.
Mugi — Value-Level Parallelism for Efficient LLMs
ASPLOS 2026Co-invented Mugi, a technique that exploits value-level parallelism in transformer nonlinear operations (softmax, LayerNorm, top-K) to accelerate LLM inference. Mugi reduces memory footprint while improving throughput on memory-bound workloads.
Key results: Up to 45× throughput & 668× energy efficiency for softmax; 2.07× LLM throughput and 3.11× energy efficiency end-to-end; 1.45× reduction in operational carbon and 1.48× embodied carbon. Outperforms existing nonlinear approximations in accuracy, performance, and efficiency.
Tempus Core — Temporal-Unary Convolution Core for Edge DLAs
DATE 2025Architected an INT8 temporal-unary convolution core for NVDLA, targeting low-precision edge deep learning inference. Full physical design flow on 7nm process technology: floorplanning, clock tree synthesis (CTS), place-and-route, DRC/LVS signoff.
Key results: 53% area reduction, 44% power savings, 5× iso-area throughput improvement over NVDLA baseline.
TNNGen — Automated Neuromorphic SPU Design Framework
ISCAS 2024 · TCAS-II 2024Developed TNNGen, an automation framework that compiles PyTorch Temporal Neural Network (TNN) models to DRC/LVS-clean post-layout netlists. Validated across 7 application modalities (audio, EEG, gesture, etc.).
Key result: Reduces TNN hardware design time from weeks to under 2 hours. Selected for journal publication in IEEE TCAS-II 2024.
TNN7 — Custom 7nm PDK Extension for Neuromorphic TNNs
ISVLSI 2022Devised TNN7: a custom predictive 7nm open-source PDK extension (ASAP7) comprising 9 custom hard macros for Temporal Neural Networks. Used by 3 research groups.
Key results: 14% power, 16% delay, 28% area, 45% EDP reduction over baseline ASAP7 designs.
Internship research
Industry Research Contributions
These projects were developed through research internships and industry collaborations, with an emphasis on deployment-facing performance characterization and manufacturable low-power accelerator designs.
LLM Inference Profiling on CPU-GPU Coupled Architectures
ISPASS 2025 · Samsung-FundedBuilt SKIP — a PyTorch-based profiling tool for operator-kernel dynamics in LLM inference, characterizing KV cache efficiency and GPU memory bandwidth utilization on NVIDIA H100 and GH200 Grace Hopper systems. Spearheaded a 5-person CMU–Samsung research collaboration from concept to publication.
Key results: GH200 incurs 2.8× more prefill latency and 4× larger CPU-bounded region vs PCIe H100 due to Grace CPU inefficiencies. Findings were shared with Samsung accelerator design efforts. Project funded at $150K+.
tubGEMM — Temporal-Unary-Binary GEMM Unit
ISVLSI 2023 · MediaTekDevised an ultra-low-power hybrid temporal-unary-binary GEMM unit for edge AI. Fabricated on TSMC N5 process technology during internship at MediaTek USA. Adopted for further development within MediaTek.
Key result: 40%+ power reduction versus conventional binary GEMM at same throughput.
OzMAC — Sparsity-Exploiting MAC Unit for DL Inference
VLSI-SoC 2024 · MediaTekDesigned OzMAC, a bit-serial, sparsity-exploiting multiply-accumulate unit. Achieved full timing closure on TSMC N5. Adopted for further development at MediaTek.
Talks & Presentations
Fellowships & Awards
Amar Mukherjee Best Paper Award
ISVLSI 2025
Qualcomm Innovation Fellowship — Winner
North America, 2023
CIT Dean's Fellowship
Carnegie Mellon University doctoral fellowship
Exemplary Performance Award
MediaTek USA Inc. — Innovative contribution during AI Architecture internship
ISVLSI 2024 Travel Grant
IEEE ISVLSI 2024
CMU GSA Conference Grant
Carnegie Mellon University Graduate Student Assembly
DAC Young Fellow
Design Automation Conference, 2022 — Top early-career EDA researchers
ASPLOS Young Architect
ASPLOS 2022
Professional Service
Peer Reviewer: IEEE Transactions on VLSI Systems (TVLSI) · IEEE Journal of Exploratory Solid-State Computational Devices and Circuits (JXCDC)
Memberships: IEEE-Eta Kappa Nu (HKN) · Sigma Xi Scientific Research Honor Society