Skip to content

feat: added new prometheus metrics for long conn #1305

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Apr 23, 2025

Conversation

yp969803
Copy link
Contributor

What type of PR is this?
/kind feature

What this PR does / why we need it:
New prometheus metrics for long tcp connections (duration > 30s)

Which issue(s) this PR fixes:
Fixes #1294

Special notes for your reviewer:

Does this PR introduce a user-facing change?:
Yes

New prometheus metrics for long conn

@@ -90,6 +90,34 @@ var (
"connection_security_policy",
}

connectionLabels = []string{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yp969803 we deliberately not to introduce src dst address into metrics. because in k8s, pods are scaling up/down, even migrating soon, these label are unbounded, could easily lead to memory leak. similar issues have been reported in istio. You can search there

Help: "The total number of TCP connections closed to a service",
}, serviceLabels)

tcpReceivedBytesInService = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "kmesh_tcp_received_bytes_total",
Name: "kmesh_tcp_service_received_bytes_total",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change this name?
This name is required for both kmesh+kaili and kmesh+grafana. It is not recommended to change.

"source_app",
"source_version",
"source_cluster",
"source_address",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it only source_address and destination_address that are different between connectionLabels and workloadLabels? What's is difference bwteen destination_address and 'destination_workload_address`?

Signed-off-by: Yash Patel <[email protected]>

feat: added connectionLabels for long_conn metric

Signed-off-by: Yash Patel <[email protected]>

feat: updateConnectionMetricCache and buildConnectionMetric func

Signed-off-by: Yash Patel <[email protected]>
@kmesh-bot
Copy link
Collaborator

Adding label do-not-merge/contains-merge-commits because PR contains merge commits, which are not allowed in this repository.
Use git rebase to reapply your commits on top of the target branch. Detailed instructions for doing so can be found here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

codecov bot commented Apr 18, 2025

Codecov Report

Attention: Patch coverage is 61.20219% with 71 lines in your changes missing coverage. Please review.

Project coverage is 45.87%. Comparing base (f7ab0e1) to head (6c6e3c4).
Report is 29 commits behind head on main.

Files with missing lines Patch % Lines
pkg/controller/telemetry/metric.go 63.57% 50 Missing and 5 partials ⚠️
pkg/status/status_server.go 50.00% 8 Missing and 2 partials ⚠️
pkg/controller/workload/workload_controller.go 0.00% 6 Missing ⚠️
Files with missing lines Coverage Δ
pkg/controller/telemetry/utils.go 69.33% <100.00%> (+2.66%) ⬆️
pkg/controller/workload/workload_controller.go 39.25% <0.00%> (-2.34%) ⬇️
pkg/status/status_server.go 33.00% <50.00%> (+3.26%) ⬆️
pkg/controller/telemetry/metric.go 52.39% <63.57%> (+5.75%) ⬆️

... and 3 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ae81023...6c6e3c4. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Yash Patel <[email protected]>

rfac: changed the content type of delConn string

Signed-off-by: Yash Patel <[email protected]>
@yp969803
Copy link
Contributor Author

@hzxuzhonghu @LiZhenCheng9527 can you review the pr when you get time.
Implemented prometheus metrics for long_conns (duration > 30s)

Signed-off-by: Yash Patel <[email protected]>

chore: run make gen

Signed-off-by: Yash Patel <[email protected]>

rfac: updateConnMetricCache

Signed-off-by: Yash Patel <[email protected]>
Copy link
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would you remove the long connection metric, I can tell if the connection is closed, it is meaningless to keep the metric there

}
if data.state == TCP_CLOSTED {
deleteLock.Lock()
*delConn = append(*delConn, &labels)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would prefer, make deleConn as a output val

@yp969803
Copy link
Contributor Author

I have written unit tests for conn metrics, the pr is ready for merge, i will write e2e tests in another pr also the e2e tests depends how ravjot is approaching the writing of e2e. @LiZhenCheng9527 @hzxuzhonghu

@yp969803 yp969803 force-pushed the issue1294 branch 2 times, most recently from 2ca6789 to d1b06a7 Compare April 21, 2025 13:43
Comment on lines 304 to 340
var info string
if connectionMetricsInfo == "enable" {
info = "true"
} else if connectionMetricsInfo == "disable" {
info = "false"
} else {
log.Errorf("Error: Argument must be 'enable' or 'disable'")
os.Exit(1)
}

fw, err := utils.CreateKmeshPortForwarder(cli, podName)
if err != nil {
log.Errorf("failed to create port forwarder for Kmesh daemon pod %s: %v", podName, err)
os.Exit(1)
}
if err := fw.Start(); err != nil {
log.Errorf("failed to start port forwarder for Kmesh daemon pod %s: %v", podName, err)
os.Exit(1)
}
defer fw.Close()

url := fmt.Sprintf("http://%s%s?enable=%s", fw.Address(), patternConnectionMetrics, info)

req, err := http.NewRequest(http.MethodPost, url, nil)
if err != nil {
log.Errorf("Error creating request: %v", err)
return
}

req.Header.Set("Content-Type", "application/json")
client := &http.Client{}
resp, err := client.Do(req)
if err != nil {
log.Errorf("failed to make HTTP request: %v", err)
return
}
defer resp.Body.Close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIt: we can abstract it and reuse it. AFAIK, all the enable/disable should share this

return
}
bodyString := string(bodyBytes)
if resp.StatusCode == http.StatusBadRequest && bytes.Contains(bodyBytes, []byte("Kmesh monitoring is disable, cannot enable accesslog")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not accesslog here


Prometheus metrics exposed

- kmesh_tcp_connection_sent_bytes_total : The total number of bytes sent over established TCP connection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kmesh_tcp_connection_sent_bytes_total

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or kmesh_tcp_connection_sent_bytes_total

}
if data.state == TCP_CLOSTED {
deleteLock.Lock()
delConn = append(delConn, &labels)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we donot need the l;ock, you can declare a temp variable and return

Copy link
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, we need to handle the nits and add some tests

@hzxuzhonghu hzxuzhonghu requested a review from Copilot April 22, 2025 03:25
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces new Prometheus metrics to monitor long-lived TCP connections (duration > 30 seconds) while updating both backend controllers and client interfaces. Key changes include:

  • Adding a new HTTP endpoint (/connection_metrics) in the status server and corresponding handler logic.
  • Introducing connection metric triggers and cache updates in the workload controller and metric controller.
  • Updating tests, telemetry utilities, and documentation to support the new connection metrics.

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pkg/status/status_server.go Added endpoint and handler for connection metrics and updated monitoring triggers.
pkg/controller/workload/workload_controller.go Introduced new setter for connection metric trigger.
pkg/controller/telemetry/utils_test.go Updated tests to cover new connection metric gauges and deletion logic.
pkg/controller/telemetry/utils.go Registered new connection metric gauges and added helper to delete connection metrics.
pkg/controller/telemetry/metric.go Extended metric controller to handle and update connection metric caches.
docs/proposal/tcp_long_connection_metrics.md Updated design proposal to document long connection metrics.
docs/ctl/kmeshctl_monitoring.md, ctl/monitoring/monitoring.go Added CLI flags and instructions for enabling/disabling connection metrics.

Comment on lines +307 to +308
_, _ = w.Write([]byte(fmt.Sprintf("invalid accesslog enable=%s", info)))
return
Copy link
Preview

Copilot AI Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message currently refers to 'accesslog' instead of connection metrics. Consider updating it to something like 'invalid connection metrics enable=%s' for clarity.

Suggested change
_, _ = w.Write([]byte(fmt.Sprintf("invalid accesslog enable=%s", info)))
return
_, _ = w.Write([]byte(fmt.Sprintf("invalid connection metrics enable=%s", info)))

Copilot uses AI. Check for mistakes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 174 to 175
We will expose metrics for the connections whose duration exceesds 30 seconds. Not exposing metrics for short connection as it can lead to lot of metrics and they are also not suitable for prometheus metrics because prometheus itself has a scrape interval of maximum 15s, and short-lived connections may start and end between scrapes, resulting in incomplete or misleading data. By focusing only on longer-lived connections, we ensure the metrics are stable, meaningful, and better aligned with Prometheus’s time-series data model.

Copy link
Preview

Copilot AI Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo detected: 'exceesds' should be corrected to 'exceeds'.

Suggested change
We will expose metrics for the connections whose duration exceesds 30 seconds. Not exposing metrics for short connection as it can lead to lot of metrics and they are also not suitable for prometheus metrics because prometheus itself has a scrape interval of maximum 15s, and short-lived connections may start and end between scrapes, resulting in incomplete or misleading data. By focusing only on longer-lived connections, we ensure the metrics are stable, meaningful, and better aligned with Prometheus’s time-series data model.
We will expose metrics for the connections whose duration exceeds 30 seconds. Not exposing metrics for short connection as it can lead to lot of metrics and they are also not suitable for prometheus metrics because prometheus itself has a scrape interval of maximum 15s, and short-lived connections may start and end between scrapes, resulting in incomplete or misleading data. By focusing only on longer-lived connections, we ensure the metrics are stable, meaningful, and better aligned with Prometheus’s time-series data model.

Copilot uses AI. Check for mistakes.

@yp969803
Copy link
Contributor Author

@hzxuzhonghu done the chages, added unit tests where are possible,

@hzxuzhonghu
Copy link
Member

Thanks

bodyBytes, readErr := io.ReadAll(resp.Body)
if readErr != nil {
log.Errorf("Error reading response body: %v", readErr)
if observablityType == MONITORING {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove? always log the error

Copy link
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@yp969803 I would merge first and then you can continue fix the nits

@kmesh-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hzxuzhonghu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kmesh-bot kmesh-bot merged commit e579883 into kmesh-net:main Apr 23, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New prometheus-metric for long conn
4 participants