Why Inference is hard..
Inference requires efficient loading and quantization of the model. This video covers the depth and breadth of various methods when it comes to loading and quantization like mmap, standard quantization, GGUF, AWQ, EXL2, FP8, and NVFP4. We also get into various inference engines like llama.cpp, vLLM, SGLang, TensorRT-LLM, and TGI - though the difference here will be accentuated more as we talk about pre-fill, decoding, and serving the model for concurrency and scheduling.
#inference #deeplearning #llm
Zo Computer:
https://zo.computer
Chapters
00:00 Intro
01:14 Artifacts
02:46 Load
03:30 mmap
05:52 Sponsor: Zo
06:38 Quantization
07:43 Standard
09:52 GGUF
11:51 AWQ
13:05 EXL2
14:19 FP8, NVFP4
14:42 Conclusion