Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C., Gonzalez, J.E., et al. (2023) Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv2309.06180. - References

Article citationsMore>>

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C., Gonzalez, J.E., et al. (2023) Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv:2309.06180.

has been cited by the following article:

TITLE: Edge-Centric Generative AI: A Survey on Efficient Inference for Large Language Models in Resource-Constrained Environments

AUTHORS: Rui Huang

KEYWORDS: Large Language Models, Edge AI, Model Compression, Quantization, Neuromorphic Computing, Heterogeneous Systems

JOURNAL NAME: Journal of Computer and Communications, Vol.14 No.4, April 30, 2026

ABSTRACT: The deployment of Large Language Models (LLMs) on edge devices represents a paradigm shift in artificial intelligence, transitioning from cloud-centric dependence to pervasive, privacy-preserving on-device intelligence. However, enabling multi-billion parameter models on battery-powered devices (e.g., AR glasses, mobile agents) faces severe bottlenecks in memory bandwidth, energy efficiency, and thermal dissipation. This survey presents a comprehensive taxonomy of efficient inference techniques, rigorously categorizing recent advancements in post-training quantization, structural pruning, speculative decoding, and heterogeneous system scheduling. Unlike prior reviews, we formalize the trade-offs between model fidelity and hardware constraints through the lens of the Roofline Model and thermodynamic limits. We further identify critical industrial challenges in enabling “always-on” agents, proposing a roadmap for future hardware-algorithm co-design.

	[email protected]
	+86 18163351462(WhatsApp)
	1655362766

	Paper Publishing WeChat

Journals by Subject

Publish with us

Article citationsMore>>

Home

About SCIRP

Service

Policies