Skip to content

Labels

Labels

  • Something isn't working
  • automated tests, build checks, github actions, system stability & efficiency.
  • help/insights needed from community
  • PRs initiated from Community
  • Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.
  • Pull requests that update a dependency file
  • Deploying TRTLLM with separated, distributed components (params, kv-cache, compute). Arch & perf.
  • TRTLLM's textual/illustrative materials: API refs, guides, tutorials. Improvement & clarity.
  • This issue or pull request already exists
  • Items about improving or complaints about TRTLLM ease of use
  • New feature or request. This includes new model, dtype, functionality support
  • General operational aspects of TRTLLM execution not in other categories.
  • Extra attention is needed
  • Setting up and building TRTLLM: compilation, pip install, dependencies, env config, CMake.
  • kv-cache management for efficient LLM inference
  • High-level LLM Python API & tools (e.g., trtllm-llmapi-launch) for TRTLLM inference/workflows.
  • Parameter-Efficient Fine-Tuning (PEFT) like LoRA/P-tuning in TRTLLM: adapter use & perf.
  • Lower-precision formats (INT8/INT4/FP8) for TRTLLM quantization (AWQ, GPTQ).
  • Memory utilization in TRTLLM: leak/OOM handling, footprint optimization, memory profiling.
  • Further info is required from the requester for devs to help
  • Request to add a new model
  • Some known limitation, but not a bug.
  • trtllm-serve's OpenAI-compatible API: endpoint behavior, req/resp formats, feature parity.
  • TRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts.