feat: added new prometheus metrics for long conn #1305

yp969803 · 2025-04-17T04:16:01Z

What type of PR is this?
/kind feature

What this PR does / why we need it:
New prometheus metrics for long tcp connections (duration > 30s)

Which issue(s) this PR fixes:
Fixes #1294

Special notes for your reviewer:

Does this PR introduce a user-facing change?:
Yes

New prometheus metrics for long conn

hzxuzhonghu · 2025-04-17T06:30:54Z

pkg/controller/telemetry/utils.go

@@ -90,6 +90,34 @@ var (
 		"connection_security_policy",
 	}

+	connectionLabels = []string{


@yp969803 we deliberately not to introduce src dst address into metrics. because in k8s, pods are scaling up/down, even migrating soon, these label are unbounded, could easily lead to memory leak. similar issues have been reported in istio. You can search there

LiZhenCheng9527 · 2025-04-17T06:34:35Z

pkg/controller/telemetry/utils.go

 			Help: "The total number of TCP connections closed to a service",
 		}, serviceLabels)

 	tcpReceivedBytesInService = prometheus.NewGaugeVec(
 		prometheus.GaugeOpts{
-			Name: "kmesh_tcp_received_bytes_total",
+			Name: "kmesh_tcp_service_received_bytes_total",


Why did you change this name?
This name is required for both kmesh+kaili and kmesh+grafana. It is not recommended to change.

LiZhenCheng9527 · 2025-04-17T06:48:37Z

pkg/controller/telemetry/utils.go

+		"source_app",
+		"source_version",
+		"source_cluster",
+		"source_address",


Is it only source_address and destination_address that are different between connectionLabels and workloadLabels? What's is difference bwteen destination_address and 'destination_workload_address`?

Signed-off-by: Yash Patel <[email protected]> feat: added connectionLabels for long_conn metric Signed-off-by: Yash Patel <[email protected]> feat: updateConnectionMetricCache and buildConnectionMetric func Signed-off-by: Yash Patel <[email protected]>

Signed-off-by: Yash Patel <[email protected]>

kmesh-bot · 2025-04-18T04:01:48Z

Adding label do-not-merge/contains-merge-commits because PR contains merge commits, which are not allowed in this repository.
Use git rebase to reapply your commits on top of the target branch. Detailed instructions for doing so can be found here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Signed-off-by: Yash Patel <[email protected]>

codecov · 2025-04-18T07:54:10Z

Codecov Report

Attention: Patch coverage is 61.20219% with 71 lines in your changes missing coverage. Please review.

Project coverage is 45.87%. Comparing base (f7ab0e1) to head (6c6e3c4).
Report is 29 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/controller/telemetry/metric.go	63.57%	50 Missing and 5 partials ⚠️
pkg/status/status_server.go	50.00%	8 Missing and 2 partials ⚠️
pkg/controller/workload/workload_controller.go	0.00%	6 Missing ⚠️

Files with missing lines	Coverage Δ
pkg/controller/telemetry/utils.go	`69.33% <100.00%> (+2.66%)`	⬆️
pkg/controller/workload/workload_controller.go	`39.25% <0.00%> (-2.34%)`	⬇️
pkg/status/status_server.go	`33.00% <50.00%> (+3.26%)`	⬆️
pkg/controller/telemetry/metric.go	`52.39% <63.57%> (+5.75%)`	⬆️

... and 3 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ae81023...6c6e3c4. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Yash Patel <[email protected]> rfac: changed the content type of delConn string Signed-off-by: Yash Patel <[email protected]>

yp969803 · 2025-04-19T02:09:14Z

@hzxuzhonghu @LiZhenCheng9527 can you review the pr when you get time.
Implemented prometheus metrics for long_conns (duration > 30s)

Signed-off-by: Yash Patel <[email protected]> chore: run make gen Signed-off-by: Yash Patel <[email protected]> rfac: updateConnMetricCache Signed-off-by: Yash Patel <[email protected]>

hzxuzhonghu

When would you remove the long connection metric, I can tell if the connection is closed, it is meaningless to keep the metric there

hzxuzhonghu · 2025-04-21T07:18:14Z

pkg/controller/telemetry/metric.go

+	}
+	if data.state == TCP_CLOSTED {
+		deleteLock.Lock()
+		*delConn = append(*delConn, &labels)


would prefer, make deleConn as a output val

Signed-off-by: Yash Patel <[email protected]>

yp969803 · 2025-04-21T13:06:45Z

I have written unit tests for conn metrics, the pr is ready for merge, i will write e2e tests in another pr also the e2e tests depends how ravjot is approaching the writing of e2e. @LiZhenCheng9527 @hzxuzhonghu

hzxuzhonghu · 2025-04-22T03:18:43Z

ctl/monitoring/monitoring.go

+	var info string
+	if connectionMetricsInfo == "enable" {
+		info = "true"
+	} else if connectionMetricsInfo == "disable" {
+		info = "false"
+	} else {
+		log.Errorf("Error: Argument must be 'enable' or 'disable'")
+		os.Exit(1)
+	}
+
+	fw, err := utils.CreateKmeshPortForwarder(cli, podName)
+	if err != nil {
+		log.Errorf("failed to create port forwarder for Kmesh daemon pod %s: %v", podName, err)
+		os.Exit(1)
+	}
+	if err := fw.Start(); err != nil {
+		log.Errorf("failed to start port forwarder for Kmesh daemon pod %s: %v", podName, err)
+		os.Exit(1)
+	}
+	defer fw.Close()
+
+	url := fmt.Sprintf("http://%s%s?enable=%s", fw.Address(), patternConnectionMetrics, info)
+
+	req, err := http.NewRequest(http.MethodPost, url, nil)
+	if err != nil {
+		log.Errorf("Error creating request: %v", err)
+		return
+	}
+
+	req.Header.Set("Content-Type", "application/json")
+	client := &http.Client{}
+	resp, err := client.Do(req)
+	if err != nil {
+		log.Errorf("failed to make HTTP request: %v", err)
+		return
+	}
+	defer resp.Body.Close()


NIt: we can abstract it and reuse it. AFAIK, all the enable/disable should share this

hzxuzhonghu · 2025-04-22T03:19:33Z

ctl/monitoring/monitoring.go

+			return
+		}
+		bodyString := string(bodyBytes)
+		if resp.StatusCode == http.StatusBadRequest && bytes.Contains(bodyBytes, []byte("Kmesh monitoring is disable, cannot enable accesslog")) {


not accesslog here

hzxuzhonghu · 2025-04-22T03:20:53Z

docs/proposal/tcp_long_connection_metrics.md

+
+Prometheus metrics exposed 
+
+- kmesh_tcp_connection_sent_bytes_total : The total number of bytes sent over established TCP connection


kmesh_tcp_connection_sent_bytes_total

or kmesh_tcp_connection_sent_bytes_total

hzxuzhonghu · 2025-04-22T03:24:58Z

pkg/controller/telemetry/metric.go

+	}
+	if data.state == TCP_CLOSTED {
+		deleteLock.Lock()
+		delConn = append(delConn, &labels)


we donot need the l;ock, you can declare a temp variable and return

hzxuzhonghu

Generally LGTM, we need to handle the nits and add some tests

Copilot

Pull Request Overview

This PR introduces new Prometheus metrics to monitor long-lived TCP connections (duration > 30 seconds) while updating both backend controllers and client interfaces. Key changes include:

Adding a new HTTP endpoint (/connection_metrics) in the status server and corresponding handler logic.
Introducing connection metric triggers and cache updates in the workload controller and metric controller.
Updating tests, telemetry utilities, and documentation to support the new connection metrics.

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
pkg/status/status_server.go	Added endpoint and handler for connection metrics and updated monitoring triggers.
pkg/controller/workload/workload_controller.go	Introduced new setter for connection metric trigger.
pkg/controller/telemetry/utils_test.go	Updated tests to cover new connection metric gauges and deletion logic.
pkg/controller/telemetry/utils.go	Registered new connection metric gauges and added helper to delete connection metrics.
pkg/controller/telemetry/metric.go	Extended metric controller to handle and update connection metric caches.
docs/proposal/tcp_long_connection_metrics.md	Updated design proposal to document long connection metrics.
docs/ctl/kmeshctl_monitoring.md, ctl/monitoring/monitoring.go	Added CLI flags and instructions for enabling/disabling connection metrics.

Copilot · 2025-04-22T03:26:18Z

pkg/status/status_server.go

+		_, _ = w.Write([]byte(fmt.Sprintf("invalid accesslog enable=%s", info)))
+		return


The error message currently refers to 'accesslog' instead of connection metrics. Consider updating it to something like 'invalid connection metrics enable=%s' for clarity.

Suggested change

_, _ = w.Write([]byte(fmt.Sprintf("invalid accesslog enable=%s", info)))

return

_, _ = w.Write([]byte(fmt.Sprintf("invalid connection metrics enable=%s", info)))

Copilot · 2025-04-22T03:26:18Z

docs/proposal/tcp_long_connection_metrics.md

+We will expose metrics for the connections whose duration exceesds 30 seconds. Not exposing metrics for short connection as it can lead to lot of metrics and they are also not suitable for prometheus metrics because prometheus itself has a scrape interval of maximum 15s, and short-lived connections may start and end between scrapes, resulting in incomplete or misleading data. By focusing only on longer-lived connections, we ensure the metrics are stable, meaningful, and better aligned with Prometheus’s time-series data model.
+


Typo detected: 'exceesds' should be corrected to 'exceeds'.

Suggested change

We will expose metrics for the connections whose duration exceesds 30 seconds. Not exposing metrics for short connection as it can lead to lot of metrics and they are also not suitable for prometheus metrics because prometheus itself has a scrape interval of maximum 15s, and short-lived connections may start and end between scrapes, resulting in incomplete or misleading data. By focusing only on longer-lived connections, we ensure the metrics are stable, meaningful, and better aligned with Prometheus’s time-series data model.

We will expose metrics for the connections whose duration exceeds 30 seconds. Not exposing metrics for short connection as it can lead to lot of metrics and they are also not suitable for prometheus metrics because prometheus itself has a scrape interval of maximum 15s, and short-lived connections may start and end between scrapes, resulting in incomplete or misleading data. By focusing only on longer-lived connections, we ensure the metrics are stable, meaningful, and better aligned with Prometheus’s time-series data model.

Signed-off-by: Yash Patel <[email protected]>

… in monitoring.go Signed-off-by: Yash Patel <[email protected]>

… status_Server_test.go Signed-off-by: Yash Patel <[email protected]>

yp969803 · 2025-04-22T08:19:33Z

@hzxuzhonghu done the chages, added unit tests where are possible,

hzxuzhonghu · 2025-04-22T09:00:57Z

Thanks

hzxuzhonghu · 2025-04-23T01:52:39Z

ctl/monitoring/monitoring.go

-		bodyBytes, readErr := io.ReadAll(resp.Body)
-		if readErr != nil {
-			log.Errorf("Error reading response body: %v", readErr)
+		if observablityType == MONITORING {


remove? always log the error

hzxuzhonghu

/lgtm

@yp969803 I would merge first and then you can continue fix the nits

kmesh-bot · 2025-04-23T02:03:37Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hzxuzhonghu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [hzxuzhonghu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kmesh-bot added the kind/feature label Apr 17, 2025

kmesh-bot requested review from hzxuzhonghu and nlgwcy April 17, 2025 04:16

yp969803 marked this pull request as draft April 17, 2025 04:16

kmesh-bot added do-not-merge/work-in-progress size/M labels Apr 17, 2025

yp969803 force-pushed the issue1294 branch from 2aa5d86 to f9adba3 Compare April 17, 2025 04:19

yp969803 marked this pull request as ready for review April 17, 2025 04:29

kmesh-bot removed the do-not-merge/work-in-progress label Apr 17, 2025

kmesh-bot requested a review from YaoZengzeng April 17, 2025 04:30

kmesh-bot added size/L and removed size/M labels Apr 17, 2025

hzxuzhonghu reviewed Apr 17, 2025

View reviewed changes

LiZhenCheng9527 reviewed Apr 17, 2025

View reviewed changes

yp969803 force-pushed the issue1294 branch from 43820a4 to 4102c5a Compare April 17, 2025 13:45

yp969803 force-pushed the issue1294 branch from 4102c5a to f144b3d Compare April 17, 2025 13:58

yp969803 added 2 commits April 17, 2025 20:35

rfac: added service info connectionMetric labels

a8e1377

Signed-off-by: Yash Patel <[email protected]>

rfac: updated tcp_lon_conn proposal for long_conn prometheus metrics

14459c5

Signed-off-by: Yash Patel <[email protected]>

yp969803 force-pushed the issue1294 branch from 75864ee to 14459c5 Compare April 17, 2025 15:31

Merge branch 'main' into issue1294

65a5de7

kmesh-bot added the do-not-merge/contains-merge-commits label Apr 18, 2025

feat: added conn_metrics handler

e97bbf4

Signed-off-by: Yash Patel <[email protected]>

feat: deleteConnMetric

32da030

Signed-off-by: Yash Patel <[email protected]> rfac: changed the content type of delConn string Signed-off-by: Yash Patel <[email protected]>

yp969803 force-pushed the issue1294 branch from 84a8ab6 to 32da030 Compare April 19, 2025 01:30

feat: added kmeshctl command to enable con metrics

3cc98ca

Signed-off-by: Yash Patel <[email protected]> chore: run make gen Signed-off-by: Yash Patel <[email protected]> rfac: updateConnMetricCache Signed-off-by: Yash Patel <[email protected]>

hzxuzhonghu reviewed Apr 21, 2025

View reviewed changes

yp969803 added 3 commits April 21, 2025 15:21

rfac: updateConnMetric func to return delCOnn slice

86c715f

Signed-off-by: Yash Patel <[email protected]>

feat: added buildConnMetric ut

080c96c

Signed-off-by: Yash Patel <[email protected]>

feat: BuildConnectionMetricsToPrometheus unit test

584a664

Signed-off-by: Yash Patel <[email protected]>

kmesh-bot added size/XXL and removed size/XL labels Apr 21, 2025

yp969803 force-pushed the issue1294 branch 2 times, most recently from 2ca6789 to d1b06a7 Compare April 21, 2025 13:43

hzxuzhonghu reviewed Apr 22, 2025

View reviewed changes

hzxuzhonghu requested a review from Copilot April 22, 2025 03:25

Copilot AI reviewed Apr 22, 2025

View reviewed changes

rfac: added connMetric in TestMetricController_updatePrometheusMetric ut

c1d858c

Signed-off-by: Yash Patel <[email protected]>

yp969803 force-pushed the issue1294 branch from d1b06a7 to c1d858c Compare April 22, 2025 04:42

yp969803 added 2 commits April 22, 2025 10:49

rfac: merged setMonitoring functions to one SetObservability function…

8469d92

… in monitoring.go Signed-off-by: Yash Patel <[email protected]>

rfac: added connmetricHandler and workloadMetricHandler unit tests in…

6c6e3c4

… status_Server_test.go Signed-off-by: Yash Patel <[email protected]>

yp969803 force-pushed the issue1294 branch from 1aa6ef1 to 6c6e3c4 Compare April 22, 2025 08:18

hzxuzhonghu reviewed Apr 23, 2025

View reviewed changes

hzxuzhonghu approved these changes Apr 23, 2025

View reviewed changes

kmesh-bot assigned hzxuzhonghu Apr 23, 2025

kmesh-bot added the lgtm label Apr 23, 2025

kmesh-bot added the approved label Apr 23, 2025

kmesh-bot merged commit e579883 into kmesh-net:main Apr 23, 2025
11 checks passed


		Prometheus metrics exposed

		- kmesh_tcp_connection_sent_bytes_total : The total number of bytes sent over established TCP connection

		_, _ = w.Write([]byte(fmt.Sprintf("invalid accesslog enable=%s", info)))
		return

		We will expose metrics for the connections whose duration exceesds 30 seconds. Not exposing metrics for short connection as it can lead to lot of metrics and they are also not suitable for prometheus metrics because prometheus itself has a scrape interval of maximum 15s, and short-lived connections may start and end between scrapes, resulting in incomplete or misleading data. By focusing only on longer-lived connections, we ensure the metrics are stable, meaningful, and better aligned with Prometheus’s time-series data model.

feat: added new prometheus metrics for long conn #1305

feat: added new prometheus metrics for long conn #1305

Uh oh!

Conversation

yp969803 commented Apr 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kmesh-bot commented Apr 18, 2025

Uh oh!

codecov bot commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

yp969803 commented Apr 19, 2025

Uh oh!

hzxuzhonghu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yp969803 commented Apr 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hzxuzhonghu left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

yp969803 commented Apr 22, 2025

Uh oh!

hzxuzhonghu commented Apr 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hzxuzhonghu left a comment

Choose a reason for hiding this comment

Uh oh!

kmesh-bot commented Apr 23, 2025

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Apr 18, 2025 •

edited

Loading