Publications | Prabhu Vellaisamy

Preprints

Recent papers in preprint form that are already pushing into the next review cycle.

Preprint NeuroAI Temporal Neural Networks (NeuTNNs): Microarchitecture and Design Framework for Specialized Neuromorphic Processing Units

S. Venkatachalam, P. Vellaisamy, H. Nair, W.C. Huang, Y. Na, Y. Kang, Q. Jacobson, J.P. Shen

Feb 2026 · arXiv:2602.01546

NeuTNNs define a neuromorphic processing framework built around active dendrites and hierarchical proximal/distal segments. NeuTNNGen links the model to layout generation, while synaptic pruning reduces hardware cost without sacrificing accuracy across sensory tasks.

Preprint A-Graph: A Unified Graph Representation for At-Will Simulation across System Stacks

D. Price, P. Vellaisamy, P. Gonzalez, G. Michelogiannakis, J.P. Shen, D. Wu

Feb 2026 · arXiv:2602.04847

A-Graph proposes a unified graph representation for applications, software stacks, and architectures so the same model can support simulation, analysis, and cross-stack optimization.

Preprint TaxBreak: Unmasking the Hidden Costs of LLM Inference Through Overhead Decomposition

P. Vellaisamy, S. Tripathi, V. Natarajan, S.S. Thenarasu, R.D.S. Blanton, J.P. Shen

Mar 2026 · arXiv:2603.12465

TaxBreak decomposes host-visible orchestration overhead in LLM inference into framework translation, CUDA library translation, and kernel launch-path costs. The work introduces the Host-Device Balance Index to make host and device bottlenecks easier to compare and optimize.

Peer-Reviewed Papers

Papers spanning LLM systems, low-power accelerators, and neuromorphic hardware.

[1]TaxBreak: Unmasking the Hidden Costs of LLM Inference Through Overhead Decomposition

P. Vellaisamy, S. Tripathi, V. Natarajan, S.S. Thenarasu, R.D.S. Blanton, J.P. Shen

IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) 2026 · Accepted · Scheduled for presentation on April 27, 2026 · arXiv:2603.12465

TaxBreak decomposes host-visible orchestration overhead in LLM inference into framework translation, CUDA library translation, and kernel launch-path costs. The work introduces the Host-Device Balance Index to make host and device bottlenecks easier to compare and optimize.

[2]Mugi: Value Level Parallelism for Efficient LLMs

D. Price, P. Vellaisamy, J.P. Shen, D. Wu

ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) 2026

Mugi uses value-level parallelism to restructure nonlinear LLM operations and small-batch GEMMs. The paper shows that the approach can raise throughput and efficiency by turning value computation into a more parallel execution pattern.

[3]Catwalk: Unary Top-K for Efficient Ramp-No-Leak Neuron Design for Temporal Neural Networks

D. Lister, P. Vellaisamy, J.P. Shen, D. Wu

IEEE Computer Society Annual Symposium on VLSI (ISVLSI) 2025 Amar Mukherjee Best Paper Award

Catwalk introduces a unary top-K mechanism for temporal neural networks, targeting the ramp-no-leak neuron design. The result is a more efficient selection path that reduces overhead in the neuromorphic pipeline.

[4]Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures

P. Vellaisamy, T. Labonte, S. Chakraborty, M. Turner, S. Sury, J.P. Shen

IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) 2025 Invited Talk at Jülich Supercomputing Center

This paper characterizes prefill and decode bottlenecks across coupled CPU-GPU systems and compares NVIDIA H100 and GH200 behavior. The work isolates where orchestration, memory movement, and kernel execution dominate end-to-end latency.

[5]Tempus Core: Area-Power Efficient Temporal-Unary Convolution Core for Low-Precision Edge DLAs

P. Vellaisamy, H. Nair, T. Kang, Y. Ni, H. Fan, B. Qi, H.F. Hung, J. Chen, R.D.S. Blanton, J.P. Shen

IEEE Design, Automation & Test in Europe (DATE) 2025

Tempus Core presents a temporal-unary convolution engine for low-precision edge DLA workloads. The paper focuses on reducing area and power while preserving useful throughput for compact accelerator designs.

[6]OzMAC: An Energy-Efficient Sparsity-Exploiting Multiply-Accumulate-Unit Design for DL Inference

H. Nair, P. Vellaisamy, T.H. Lin, P. Wang, R.D.S. Blanton, J.P. Shen

IEEE VLSI-SoC 2024

OzMAC targets sparsity-aware inference by reducing unnecessary multiply-accumulate work. The design explores how bit sparsity can translate directly into lower energy cost in deep learning accelerators.

[7]Exploration of Unary Arithmetic-Based Matrix Multiply Units for Low Precision DL Accelerators

P. Vellaisamy, H. Nair, D. Wu, R.D.S. Blanton, J.P. Shen

IEEE Computer Society Annual Symposium on VLSI (ISVLSI) 2024

This work explores unary arithmetic for matrix multiplication in low-precision AI hardware. It studies how temporal unary representations change the area-power tradeoffs relative to more conventional arithmetic units.

[8]Realtime Person Identification via Gait Analysis using IMU Sensors on Edge Devices

S. Venkatachelam, H. Nair, P. Vellaisamy, Y. Zhou, Z. Youssfi, J.P. Shen

International Conference on Neuromorphic Systems (ICONS) 2024

The paper demonstrates real-time person identification from gait signals measured with IMU sensors on edge devices. It highlights how lightweight sensing and edge inference can support privacy-preserving identification tasks.

[9]TNNGen: Automated Design of Neuromorphic Sensory Processing Units for Time-Series Clustering

P. Vellaisamy, H. Nair, D. Gupta, V. Ratnakaram, J.P. Shen

IEEE International Symposium on Circuits and Systems (ISCAS) 2024 Invited for TCAS-II

TNNGen automates the design of neuromorphic sensory processing units for time-series clustering. The work links algorithmic TNN ideas to generated hardware and reports a design path that reduces synapse and implementation cost.

[10]tubGEMM: Energy-Efficient and Sparsity-Effective Temporal-Unary-Binary Based Matrix Multiply Unit

P. Vellaisamy, H. Nair, J. Finn, M. Trivedi, A. Chen, A. Li, T.H. Lin, P. Wang, R.D.S. Blanton, J.P. Shen

IEEE Computer Society Annual Symposium on VLSI (ISVLSI) 2023

tubGEMM combines temporal, unary, and binary arithmetic to build a matrix-multiply unit that is more tolerant of sparsity and low precision. The design studies how hybrid arithmetic can improve energy efficiency without giving up too much flexibility.

[11]tuGEMM: Area-Power-Efficient Temporal Unary GEMM Architecture for Low Resolution Edge AI

H. Nair, P. Vellaisamy, A. Chen, J. Finn, A. Li, M. Trivedi, J.P. Shen

IEEE International Symposium on Circuits and Systems (ISCAS) 2023

tuGEMM explores temporal unary GEMM as an implementation strategy for low-resolution edge AI. The paper emphasizes area and power reductions while preserving enough arithmetic throughput for practical inference workloads.

[12]TNN7: A Custom Macro Suite for Implementing Highly Optimized Designs of Neuromorphic TNNs

H. Nair, P. Vellaisamy, S. Bhasuthkar, J.P. Shen

IEEE Computer Society Annual Symposium on VLSI (ISVLSI) 2022

TNN7 provides a macro suite for implementing temporal neural network designs. The paper lays groundwork for a more systematic neuromorphic design flow rather than one-off circuit implementations.

Workshop Papers

Workshop papers that helped shape the larger research program.

[W1]Mugi: Value Level Parallelism For Nonlinear Operations in LLMs

D. Price, P. Vellaisamy, J.P. Shen, D. Wu

Workshop on Unary Computing (WUC), ASPLOS 2026

This workshop version extends the Mugi idea to nonlinear LLM operations and frames value-level parallelism as a broader execution model.

[W2]Agraph: A Unified Graph Representation for At-Will Simulation of Emerging Stacks

D. Price, P. Vellaisamy, J.P. Shen, D. Wu

Workshop on Unary Computing (WUC), ASPLOS 2026

The workshop paper presents the A-Graph representation as a practical simulation abstraction for emerging stacks. It emphasizes broad applicability and a cleaner route from system description to experimentation.

[W3]Exploration of Unary Based GEMM Designs for Conventional AI/DL Accelerators

P. Vellaisamy, H. Nair, D. Wu, J.P. Shen

2nd Workshop on Unary Computing (WUC), ASPLOS 2024

This paper surveys unary-based GEMM design points for more conventional AI accelerators. It explores how unary arithmetic may fit into mainstream inference hardware without requiring a full architectural reset.

[W4]xBrain: Brain-Like Computing for Explainable Brain-Computer Interfaces

Q. Xi, P. Vellaisamy, D. Wu

Young Architect Workshop (YArch), ASPLOS 2024

xBrain connects brain-like computing ideas with explainable brain-computer interfaces. The workshop paper frames the work as both a systems problem and an interpretability problem.

[W5]Towards a Design Framework for TNN-Based Neuromorphic Sensory Processing Units

P. Vellaisamy, J.P. Shen

Young Architect Workshop (YArch), ASPLOS 2022

This workshop paper lays out a design framework for TNN-based neuromorphic sensory processing units. It is an early statement of the research direction that later papers expanded into complete hardware and tooling flows.