Open
Description
Is your feature request related to a problem? Please describe.
The codebase currently does not take full advantage of the log levels supported (info, debug, error, panic, etc.). We are seeing error logs where it is not actually an error. For example, when MSVC gets deleted, the controller logs the following on a cluster.
{"level":"error","ts":"2025-05-13T19:47:31.10556661Z","caller":"controller/modelservice_controller.go:343","msg":"unable to get prefill deployment","controller":"modelservice","controllerGroup":"llm-d.ai","controllerKind":"ModelService","ModelService":{"name":"facebook-opt-125m-nixl","namespace":"e2e-solution"},"namespace":"e2e-solution","name":"facebook-opt-125m-nixl","reconcileID":"7ee671da-985d-4a34-a4a6-a82fee86a006","error":"Deployment.apps \"facebook-opt-125m-nixl-decode\" not found","stacktrace":"github.com/neuralmagic/llm-d-model-service/internal/controller.(*ModelServiceReconciler).populateStatus\n\t/workspace/internal/controller/modelservice_controller.go:343\ngithub.com/neuralmagic/llm-d-model-service/internal/controller.(*ModelServiceReconciler).Reconcile\n\t/workspace/internal/controller/modelservice_controller.go:223\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:1...
{"level":"error","ts":"2025-05-13T19:47:31.105594651Z","caller":"controller/modelservice_controller.go:377","msg":"unable to get Epp deployment","controller":"modelservice","controllerGroup":"llm-d.ai","controllerKind":"ModelService","ModelService":{"name":"facebook-opt-125m-nixl","namespace":"e2e-solution"},"namespace":"e2e-solution","name":"facebook-opt-125m-nixl","reconcileID":"7ee671da-985d-4a34-a4a6-a82fee86a006","error":"Deployment.apps \"facebook-opt-125m-nixl-epp\" not found","stacktrace":"github.com/neuralmagic/llm-d-model-service/internal/controller.(*ModelServiceReconciler).populateStatus\n\t/workspace/internal/controller/modelservice_controller.go:377\ngithub.com/neuralmagic/llm-d-model-service/internal/controller.(*ModelServiceReconciler).Reconcile\n\t/workspace/internal/controller/modelservice_controller.go:223\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119\nsi...
{"level":"info","ts":"2025-05-13T19:47:31.105705591Z","caller":"controller/modelservice_controller.go:410","msg":"ModelService no longer exists, skipping status update","controller":"modelservice","controllerGroup":"llm-d.ai","controllerKind":"ModelService","ModelService":{"name":"facebook-opt-125m-nixl","namespace":"e2e-solution"},"namespace":"e2e-solution","name":"facebook-opt-125m-nixl","reconcileID":"7ee671da-985d-4a34-a4a6-a82fee86a006"}
There might be a race condition happening. As the MSVC object updates its status, the MSVC is deleted. It's unable to locate its children, thus producing the logs above.
Describe the solution approach you'd like
A better use of logging system.
Related to #128
Metadata
Metadata
Assignees
Labels
No labels