Research Highlights

EdgeReasoning: Optimizing Reasoning LLM Deployment on Edge GPUs

Benjamin Kubwimana, Qijing Jenny Huang (NVIDIA)

Edge intelligence is increasingly demanded by emerging autonomous systems such as robotics, offering privacy-preserving operation, resilience in connectivity-limited environments, and significant energy and cost advantages over cloud-based solutions. However, deploying large language models (LLMs) for reasoning tasks on edge GPUs faces critical challenges from strict latency constraints and limited computational resources. Developers must balance multiple design factors—choosing reasoning versus non-reasoning architectures, selecting appropriate model sizes, allocating token budgets, and applying test-time scaling strategies—to meet target latency and optimize accuracy.

EdgeReasoning presents a comprehensive study characterizing the deployment of reasoning LLMs on edge GPUs. We systematically quantify latency-accuracy tradeoffs across various LLM architectures and model sizes, evaluate prompt-based and model-tuning-based techniques for reducing reasoning token length while maintaining performance quality, and profile test-time scaling methods with varying degrees of parallelism to maximize accuracy under strict latency budgets. Through these analyses, EdgeReasoning maps the Pareto frontier of achievable accuracy-latency configurations, offering systematic guidance for optimal edge deployment of reasoning LLMs.

Content Addresseable Memory (Physical Design with 3nm TSMC Technology)

This study implements a 64bit Content addressable memory(CAM) in a 3nm technology with the goal of creating a design optimized for minimum energy delay and area product. A 10-T NOR CAM bitcell architecture was chosen for its fast bit-matching as well as bigger noise margin. Search-line conditioning was eliminated to reduce energy consumption, additionally, the write-enable signal was used to disable SLs during writing. Minimum sized transistors were used to optimize for area and energy. Implimented using Synopsys Design Compiler, the design was optimized for minimum EDA product with successful post-layout simulations.

Physics-informed neural network for inverse heat conduction estimation in electronic packages

The novelity of this study lies in the ability to perform thermal simulation studies solely based on microarchitectural performance and power simulations, an area that lacks support during pre-silicon design phase. Additionally, engineers can use the provided inverse conduction modeling approach to create and analyse chip surface heat maps while performing physical chip thermal characterization. The latter comes useful in hotspot identification and during troubleshooting of various cooling solutions, including interface materials, heatspreaders and vapour chambers, without having the need to invest in expensive IR cameras and complex setups for surface die heat maps.

Machine Learning assisted building energy modeling (Surrogate modeling for physics models)

A research study as part of masters thesis at Florida Tech; we focused on building a machine learning model for ML-assisted Building Energy Models (BEMs). We developed an efficient approach to design space exploration for training the model by creating a Python-based automation program that used EnergyPlus solvers and APIs to iteratively generate various physical models within a search space.