Research portfolio
Publications
A record of 12 peer-reviewed papers, 5 workshop papers, and 3 preprints. Each entry includes a short abstract-style summary so the page reads cleanly and stays current.
Preprints
Recent papers in preprint form that are already pushing into the next review cycle.
Preprints
Recent papers in preprint form that are already pushing into the next review cycle.
Preprint NeuroAI Temporal Neural Networks (NeuTNNs): Microarchitecture and Design Framework for Specialized Neuromorphic Processing Units
Feb 2026 · arXiv:2602.01546
NeuTNNs define a neuromorphic processing framework built around active dendrites and hierarchical proximal/distal segments. NeuTNNGen links the model to layout generation, while synaptic pruning reduces hardware cost without sacrificing accuracy across sensory tasks.
Preprint A-Graph: A Unified Graph Representation for At-Will Simulation across System Stacks
Feb 2026 · arXiv:2602.04847
A-Graph proposes a unified graph representation for applications, software stacks, and architectures so the same model can support simulation, analysis, and cross-stack optimization.
Preprint TaxBreak: Unmasking the Hidden Costs of LLM Inference Through Overhead Decomposition
Mar 2026 · arXiv:2603.12465
TaxBreak decomposes host-visible orchestration overhead in LLM inference into framework translation, CUDA library translation, and kernel launch-path costs. The work introduces the Host-Device Balance Index to make host and device bottlenecks easier to compare and optimize.
Peer-Reviewed Papers
Papers spanning LLM systems, low-power accelerators, and neuromorphic hardware.
Peer-Reviewed Papers
Papers spanning LLM systems, low-power accelerators, and neuromorphic hardware.
[1]TaxBreak: Unmasking the Hidden Costs of LLM Inference Through Overhead Decomposition
IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) 2026 · Accepted · Scheduled for presentation on April 27, 2026 · arXiv:2603.12465
TaxBreak decomposes host-visible orchestration overhead in LLM inference into framework translation, CUDA library translation, and kernel launch-path costs. The work introduces the Host-Device Balance Index to make host and device bottlenecks easier to compare and optimize.
[2]Mugi: Value Level Parallelism for Efficient LLMs
ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) 2026
Mugi uses value-level parallelism to restructure nonlinear LLM operations and small-batch GEMMs. The paper shows that the approach can raise throughput and efficiency by turning value computation into a more parallel execution pattern.
[3]Catwalk: Unary Top-K for Efficient Ramp-No-Leak Neuron Design for Temporal Neural Networks
IEEE Computer Society Annual Symposium on VLSI (ISVLSI) 2025 Amar Mukherjee Best Paper Award
Catwalk introduces a unary top-K mechanism for temporal neural networks, targeting the ramp-no-leak neuron design. The result is a more efficient selection path that reduces overhead in the neuromorphic pipeline.
[4]Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures
IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) 2025 Invited Talk at Jülich Supercomputing Center
This paper characterizes prefill and decode bottlenecks across coupled CPU-GPU systems and compares NVIDIA H100 and GH200 behavior. The work isolates where orchestration, memory movement, and kernel execution dominate end-to-end latency.
[5]Tempus Core: Area-Power Efficient Temporal-Unary Convolution Core for Low-Precision Edge DLAs
IEEE Design, Automation & Test in Europe (DATE) 2025
Tempus Core presents a temporal-unary convolution engine for low-precision edge DLA workloads. The paper focuses on reducing area and power while preserving useful throughput for compact accelerator designs.
[6]OzMAC: An Energy-Efficient Sparsity-Exploiting Multiply-Accumulate-Unit Design for DL Inference
IEEE VLSI-SoC 2024
OzMAC targets sparsity-aware inference by reducing unnecessary multiply-accumulate work. The design explores how bit sparsity can translate directly into lower energy cost in deep learning accelerators.
[7]Exploration of Unary Arithmetic-Based Matrix Multiply Units for Low Precision DL Accelerators
IEEE Computer Society Annual Symposium on VLSI (ISVLSI) 2024
This work explores unary arithmetic for matrix multiplication in low-precision AI hardware. It studies how temporal unary representations change the area-power tradeoffs relative to more conventional arithmetic units.
[8]Realtime Person Identification via Gait Analysis using IMU Sensors on Edge Devices
International Conference on Neuromorphic Systems (ICONS) 2024
The paper demonstrates real-time person identification from gait signals measured with IMU sensors on edge devices. It highlights how lightweight sensing and edge inference can support privacy-preserving identification tasks.
[9]TNNGen: Automated Design of Neuromorphic Sensory Processing Units for Time-Series Clustering
IEEE International Symposium on Circuits and Systems (ISCAS) 2024 Invited for TCAS-II
TNNGen automates the design of neuromorphic sensory processing units for time-series clustering. The work links algorithmic TNN ideas to generated hardware and reports a design path that reduces synapse and implementation cost.
[10]tubGEMM: Energy-Efficient and Sparsity-Effective Temporal-Unary-Binary Based Matrix Multiply Unit
IEEE Computer Society Annual Symposium on VLSI (ISVLSI) 2023
tubGEMM combines temporal, unary, and binary arithmetic to build a matrix-multiply unit that is more tolerant of sparsity and low precision. The design studies how hybrid arithmetic can improve energy efficiency without giving up too much flexibility.
[11]tuGEMM: Area-Power-Efficient Temporal Unary GEMM Architecture for Low Resolution Edge AI
IEEE International Symposium on Circuits and Systems (ISCAS) 2023
tuGEMM explores temporal unary GEMM as an implementation strategy for low-resolution edge AI. The paper emphasizes area and power reductions while preserving enough arithmetic throughput for practical inference workloads.
[12]TNN7: A Custom Macro Suite for Implementing Highly Optimized Designs of Neuromorphic TNNs
IEEE Computer Society Annual Symposium on VLSI (ISVLSI) 2022
TNN7 provides a macro suite for implementing temporal neural network designs. The paper lays groundwork for a more systematic neuromorphic design flow rather than one-off circuit implementations.
Workshop Papers
Workshop papers that helped shape the larger research program.
Workshop Papers
Workshop papers that helped shape the larger research program.
[W1]Mugi: Value Level Parallelism For Nonlinear Operations in LLMs
Workshop on Unary Computing (WUC), ASPLOS 2026
This workshop version extends the Mugi idea to nonlinear LLM operations and frames value-level parallelism as a broader execution model.
[W2]Agraph: A Unified Graph Representation for At-Will Simulation of Emerging Stacks
Workshop on Unary Computing (WUC), ASPLOS 2026
The workshop paper presents the A-Graph representation as a practical simulation abstraction for emerging stacks. It emphasizes broad applicability and a cleaner route from system description to experimentation.
[W3]Exploration of Unary Based GEMM Designs for Conventional AI/DL Accelerators
2nd Workshop on Unary Computing (WUC), ASPLOS 2024
This paper surveys unary-based GEMM design points for more conventional AI accelerators. It explores how unary arithmetic may fit into mainstream inference hardware without requiring a full architectural reset.
[W4]xBrain: Brain-Like Computing for Explainable Brain-Computer Interfaces
Young Architect Workshop (YArch), ASPLOS 2024
xBrain connects brain-like computing ideas with explainable brain-computer interfaces. The workshop paper frames the work as both a systems problem and an interpretability problem.
[W5]Towards a Design Framework for TNN-Based Neuromorphic Sensory Processing Units
Young Architect Workshop (YArch), ASPLOS 2022
This workshop paper lays out a design framework for TNN-based neuromorphic sensory processing units. It is an early statement of the research direction that later papers expanded into complete hardware and tooling flows.