TITLE:
Edge-Centric Generative AI: A Survey on Efficient Inference for Large Language Models in Resource-Constrained Environments
AUTHORS:
Rui Huang
KEYWORDS:
Large Language Models, Edge AI, Model Compression, Quantization, Neuromorphic Computing, Heterogeneous Systems
JOURNAL NAME:
Journal of Computer and Communications,
Vol.14 No.4,
April
30,
2026
ABSTRACT: The deployment of Large Language Models (LLMs) on edge devices represents a paradigm shift in artificial intelligence, transitioning from cloud-centric dependence to pervasive, privacy-preserving on-device intelligence. However, enabling multi-billion parameter models on battery-powered devices (e.g., AR glasses, mobile agents) faces severe bottlenecks in memory bandwidth, energy efficiency, and thermal dissipation. This survey presents a comprehensive taxonomy of efficient inference techniques, rigorously categorizing recent advancements in post-training quantization, structural pruning, speculative decoding, and heterogeneous system scheduling. Unlike prior reviews, we formalize the trade-offs between model fidelity and hardware constraints through the lens of the Roofline Model and thermodynamic limits. We further identify critical industrial challenges in enabling “always-on” agents, proposing a roadmap for future hardware-algorithm co-design.