topics survey summary

inference： deepdive ，热点分析及benchmark 背后的解释 - David，Alice benchmarking，跨框架，跨版本对比模型适配低精度（精度评估及调点问题排查），fp8+wo+sq+int4wo 多几多卡推理的问题解决，tp，pp，tp=8 hang，等等 - Gu Jun， David ifb deepdive和参数配置 trtllm 各种参数的使用 best practice Trtllm 新特性的使用 triton +vllm lora speculative decoding low-latency generation：medusa，等等

for inference sme： Int4fp8， fp8+fp8kvcache，why faster performance tuning best practice： trtllm，trtllm+triton ifb Moe 推理优化 Fp8 通信多模态推理长序列推理新 feature 显存占用优化边缘端设备

candidate:

trtllm new features, performance tuning best practice
1. kvcache
2. long seq
3. attention
4. model support
5. future - new models, new features, etc.
how to add/modify a model in TRTLLM and debug
1. decoder only
低精度 - AMMO + fp8
(h20/L20)?

training Mcore：rebase，热点分析，长序列 nemo ： lora，peft，sft；自定义模型；节省显存；auto configurator Fp8 速度+精度 Moe：mcore 支持程度，二次开发，fp8 deepspeed - alex qiu，lark zhang，zhang Shawn，zhang yuekai nemo和 mcore，mlm 性能对其问题；config 怎么用新 feature，moe，cp 等等 flash attention 性能 nvtest Sd 训练

for training sme： cp deepdive mcore new features，overlap， cp，etc 最佳实践 fp8， pretrain，performance，accracy，sft zero-bubble 多模态

candidates:

fp8
nemo+mcore, moe, long-seq, multi-modal
profiling

others： 多模态（设计到视频编解码），sd 管线 t2i，t2v 数字人+llm，智能助手的构建 mutli-stream，cuda-graph for recsys 用 llm 解决实际业务问题机器人仿真 rag

candidate

GenAI