All videos
All videos
Optimizing LLM Inference: Challenges and Best Practices
October 24, 2024
This presentation delves into the world of Large Language Models (LLMs), focusing on the efficiency of LLM inference. We will discuss the tradeoff of latency and bandwidth, followed by a deep dive into techniques for accelerating inference, such as KV caching, quantization, speculative decoding, and various forms of parallelism. We will compare popular inference frameworks and address the challenge of navigating the multitude of design choices. Finally, we'll introduce Nvidia Inference Microservices as a convenient one-stop solution for achieving efficient inference on many of the popular models.
Other videos that you might like
Improving Sentiment Analysis Code in a DevOps environment
Oindrilla Chatterjee
JavaScript – Quo Vadis
Juho Vepsäläinen
Serialization toolbox for Akka messages, events and persistent state that helps achieve compile-time guarantee on serializability
Paweł Lipski, Marcin Złakowski
When life gives you lemons, make lemon grenades
Adam Gajek