-
Notifications
You must be signed in to change notification settings - Fork 109
feat: added new prometheus metrics for long conn #1305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -90,6 +90,34 @@ var ( | |||
"connection_security_policy", | |||
} | |||
|
|||
connectionLabels = []string{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yp969803 we deliberately not to introduce src dst address into metrics. because in k8s, pods are scaling up/down, even migrating soon, these label are unbounded, could easily lead to memory leak. similar issues have been reported in istio. You can search there
pkg/controller/telemetry/utils.go
Outdated
Help: "The total number of TCP connections closed to a service", | ||
}, serviceLabels) | ||
|
||
tcpReceivedBytesInService = prometheus.NewGaugeVec( | ||
prometheus.GaugeOpts{ | ||
Name: "kmesh_tcp_received_bytes_total", | ||
Name: "kmesh_tcp_service_received_bytes_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you change this name?
This name is required for both kmesh+kaili and kmesh+grafana. It is not recommended to change.
"source_app", | ||
"source_version", | ||
"source_cluster", | ||
"source_address", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it only source_address
and destination_address
that are different between connectionLabels and workloadLabels? What's is difference bwteen destination_address
and 'destination_workload_address`?
Signed-off-by: Yash Patel <[email protected]> feat: added connectionLabels for long_conn metric Signed-off-by: Yash Patel <[email protected]> feat: updateConnectionMetricCache and buildConnectionMetric func Signed-off-by: Yash Patel <[email protected]>
Signed-off-by: Yash Patel <[email protected]>
Signed-off-by: Yash Patel <[email protected]>
Adding label Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Signed-off-by: Yash Patel <[email protected]>
Codecov ReportAttention: Patch coverage is
... and 3 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
Signed-off-by: Yash Patel <[email protected]> rfac: changed the content type of delConn string Signed-off-by: Yash Patel <[email protected]>
@hzxuzhonghu @LiZhenCheng9527 can you review the pr when you get time. |
Signed-off-by: Yash Patel <[email protected]> chore: run make gen Signed-off-by: Yash Patel <[email protected]> rfac: updateConnMetricCache Signed-off-by: Yash Patel <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would you remove the long connection metric, I can tell if the connection is closed, it is meaningless to keep the metric there
pkg/controller/telemetry/metric.go
Outdated
} | ||
if data.state == TCP_CLOSTED { | ||
deleteLock.Lock() | ||
*delConn = append(*delConn, &labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would prefer, make deleConn as a output val
Signed-off-by: Yash Patel <[email protected]>
Signed-off-by: Yash Patel <[email protected]>
Signed-off-by: Yash Patel <[email protected]>
I have written unit tests for conn metrics, the pr is ready for merge, i will write e2e tests in another pr also the e2e tests depends how ravjot is approaching the writing of e2e. @LiZhenCheng9527 @hzxuzhonghu |
2ca6789
to
d1b06a7
Compare
ctl/monitoring/monitoring.go
Outdated
var info string | ||
if connectionMetricsInfo == "enable" { | ||
info = "true" | ||
} else if connectionMetricsInfo == "disable" { | ||
info = "false" | ||
} else { | ||
log.Errorf("Error: Argument must be 'enable' or 'disable'") | ||
os.Exit(1) | ||
} | ||
|
||
fw, err := utils.CreateKmeshPortForwarder(cli, podName) | ||
if err != nil { | ||
log.Errorf("failed to create port forwarder for Kmesh daemon pod %s: %v", podName, err) | ||
os.Exit(1) | ||
} | ||
if err := fw.Start(); err != nil { | ||
log.Errorf("failed to start port forwarder for Kmesh daemon pod %s: %v", podName, err) | ||
os.Exit(1) | ||
} | ||
defer fw.Close() | ||
|
||
url := fmt.Sprintf("http://%s%s?enable=%s", fw.Address(), patternConnectionMetrics, info) | ||
|
||
req, err := http.NewRequest(http.MethodPost, url, nil) | ||
if err != nil { | ||
log.Errorf("Error creating request: %v", err) | ||
return | ||
} | ||
|
||
req.Header.Set("Content-Type", "application/json") | ||
client := &http.Client{} | ||
resp, err := client.Do(req) | ||
if err != nil { | ||
log.Errorf("failed to make HTTP request: %v", err) | ||
return | ||
} | ||
defer resp.Body.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIt: we can abstract it and reuse it. AFAIK, all the enable/disable should share this
ctl/monitoring/monitoring.go
Outdated
return | ||
} | ||
bodyString := string(bodyBytes) | ||
if resp.StatusCode == http.StatusBadRequest && bytes.Contains(bodyBytes, []byte("Kmesh monitoring is disable, cannot enable accesslog")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not accesslog here
|
||
Prometheus metrics exposed | ||
|
||
- kmesh_tcp_connection_sent_bytes_total : The total number of bytes sent over established TCP connection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kmesh_tcp_connection_sent_bytes_total
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or kmesh_tcp_connection_sent_bytes_total
pkg/controller/telemetry/metric.go
Outdated
} | ||
if data.state == TCP_CLOSTED { | ||
deleteLock.Lock() | ||
delConn = append(delConn, &labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we donot need the l;ock, you can declare a temp variable and return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM, we need to handle the nits and add some tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces new Prometheus metrics to monitor long-lived TCP connections (duration > 30 seconds) while updating both backend controllers and client interfaces. Key changes include:
- Adding a new HTTP endpoint (/connection_metrics) in the status server and corresponding handler logic.
- Introducing connection metric triggers and cache updates in the workload controller and metric controller.
- Updating tests, telemetry utilities, and documentation to support the new connection metrics.
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
pkg/status/status_server.go | Added endpoint and handler for connection metrics and updated monitoring triggers. |
pkg/controller/workload/workload_controller.go | Introduced new setter for connection metric trigger. |
pkg/controller/telemetry/utils_test.go | Updated tests to cover new connection metric gauges and deletion logic. |
pkg/controller/telemetry/utils.go | Registered new connection metric gauges and added helper to delete connection metrics. |
pkg/controller/telemetry/metric.go | Extended metric controller to handle and update connection metric caches. |
docs/proposal/tcp_long_connection_metrics.md | Updated design proposal to document long connection metrics. |
docs/ctl/kmeshctl_monitoring.md, ctl/monitoring/monitoring.go | Added CLI flags and instructions for enabling/disabling connection metrics. |
_, _ = w.Write([]byte(fmt.Sprintf("invalid accesslog enable=%s", info))) | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message currently refers to 'accesslog' instead of connection metrics. Consider updating it to something like 'invalid connection metrics enable=%s' for clarity.
_, _ = w.Write([]byte(fmt.Sprintf("invalid accesslog enable=%s", info))) | |
return | |
_, _ = w.Write([]byte(fmt.Sprintf("invalid connection metrics enable=%s", info))) |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will expose metrics for the connections whose duration exceesds 30 seconds. Not exposing metrics for short connection as it can lead to lot of metrics and they are also not suitable for prometheus metrics because prometheus itself has a scrape interval of maximum 15s, and short-lived connections may start and end between scrapes, resulting in incomplete or misleading data. By focusing only on longer-lived connections, we ensure the metrics are stable, meaningful, and better aligned with Prometheus’s time-series data model. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo detected: 'exceesds' should be corrected to 'exceeds'.
We will expose metrics for the connections whose duration exceesds 30 seconds. Not exposing metrics for short connection as it can lead to lot of metrics and they are also not suitable for prometheus metrics because prometheus itself has a scrape interval of maximum 15s, and short-lived connections may start and end between scrapes, resulting in incomplete or misleading data. By focusing only on longer-lived connections, we ensure the metrics are stable, meaningful, and better aligned with Prometheus’s time-series data model. | |
We will expose metrics for the connections whose duration exceeds 30 seconds. Not exposing metrics for short connection as it can lead to lot of metrics and they are also not suitable for prometheus metrics because prometheus itself has a scrape interval of maximum 15s, and short-lived connections may start and end between scrapes, resulting in incomplete or misleading data. By focusing only on longer-lived connections, we ensure the metrics are stable, meaningful, and better aligned with Prometheus’s time-series data model. |
Copilot uses AI. Check for mistakes.
Signed-off-by: Yash Patel <[email protected]>
… in monitoring.go Signed-off-by: Yash Patel <[email protected]>
… status_Server_test.go Signed-off-by: Yash Patel <[email protected]>
@hzxuzhonghu done the chages, added unit tests where are possible, |
Thanks |
bodyBytes, readErr := io.ReadAll(resp.Body) | ||
if readErr != nil { | ||
log.Errorf("Error reading response body: %v", readErr) | ||
if observablityType == MONITORING { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove? always log the error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
@yp969803 I would merge first and then you can continue fix the nits
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hzxuzhonghu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
What this PR does / why we need it:
New prometheus metrics for long tcp connections (duration > 30s)
Which issue(s) this PR fixes:
Fixes #1294
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Yes