Skip to content

feat: replace events library with shdict #12353

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

Revolyssup
Copy link
Contributor

Breakout from: #12263
Fixes #
Currently nacos discovery uses event library and during testing it was found out that when using lua-resty-events, sometimes not all workers get the events. And inconsistencies emerge. Moreover the idiomatic way to share data between workers is through a shared dict. Therefore this PR migrates nacos from older events mechanism to newer shared dict way. Now only priviliged agent will fetch data from nacos and write to shdict while all workers will read from it.

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Jun 19, 2025
@@ -417,7 +359,20 @@ end


function _M.dump_data()
return {config = local_conf.discovery.nacos, services = applications or {}}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config is removed from here as it's a security risk.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense.

local http = require("resty.http")
local uri = "http://127.0.0.1:" .. ngx.var.server_port

-- Wait for 2 seconds for APISIX initialization
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test is modified to add this sleep because: Since retry mechanism is removed, priviliged agent fills data in shdict might take some time so we want to send the request after that.

local value = nacos_dict:get(key)
if value then
local nodes = core.json.decode(value)
if nodes and #nodes > 0 then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dump_data is for debug purpose, if value in shared dict for this key is empty , we should return the real empty value for debugging instead hide it.

Copy link
Member

@nic-6443 nic-6443 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thing you missed: remove unused key from shared dict in every fetch_full_registry run

Comment on lines 335 to 342
local old_curr_service_in_use = curr_service_in_use
curr_service_in_use = service_names
-- remove services that are not in use anymore
for key, _ in pairs(old_curr_service_in_use) do
if not service_names[key] then
nacos_dict:delete(key)
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
local old_curr_service_in_use = curr_service_in_use
curr_service_in_use = service_names
-- remove services that are not in use anymore
for key, _ in pairs(old_curr_service_in_use) do
if not service_names[key] then
nacos_dict:delete(key)
end
end
-- remove services that are not in use anymore
for key, _ in pairs(curr_service_in_use) do
if not service_names[key] then
nacos_dict:delete(key)
end
end
curr_service_in_use = service_names

@Revolyssup Revolyssup requested a review from membphis June 20, 2025 14:29
@Revolyssup
Copy link
Contributor Author

Revolyssup commented Jun 24, 2025

@bzp2010 I ran a benchmark to test whether the shdict:get brings too much performance degradation. But based on below results I don't think this should be an issue.

Configuration used

config.yaml

apisix:
  node_listen: 1984
deployment:
  role: data_plane
  role_data_plane:
    config_provider: yaml
nginx_config:
  worker_processes: 8
discovery:
  nacos:
      host:
        - "http://nacos:[email protected]:8848"
      prefix: "/nacos/v1/"
      fetch_interval: 1
      weight: 1
      timeout:
        connect: 2000
        send: 2000
        read: 5000

apisix.yaml

routes:
  -
    uri: /hello
    upstream:
      service_name: APISIX-NACOS
      discovery_type: nacos
      type: roundrobin
#END

Here are the results.

On master

❯ wrk -c 10 -t 5 -d 30s -R 50000 http://localhost:1984/hello
Running 30s test @ http://localhost:1984/hello
  5 threads and 10 connections
  Thread calibration: mean lat.: 2999.418ms, rate sampling interval: 10395ms
  Thread calibration: mean lat.: 2885.546ms, rate sampling interval: 10125ms
  Thread calibration: mean lat.: 2857.330ms, rate sampling interval: 10059ms
  Thread calibration: mean lat.: 2967.000ms, rate sampling interval: 10231ms
  Thread calibration: mean lat.: 2953.790ms, rate sampling interval: 10256ms
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    11.12s     3.06s   16.47s    57.49%
    Req/Sec     4.52k    82.05     4.63k    60.00%
  689405 requests in 30.00s, 110.45MB read
Requests/sec:  22980.47
Transfer/sec:      3.68MB

After changes

❯ wrk -c 10 -t 5 -d 30s -R 50000 http://localhost:1984/hello
Running 30s test @ http://localhost:1984/hello
  5 threads and 10 connections
  Thread calibration: mean lat.: 3060.053ms, rate sampling interval: 10584ms
  Thread calibration: mean lat.: 3084.401ms, rate sampling interval: 10772ms
  Thread calibration: mean lat.: 3254.041ms, rate sampling interval: 11198ms
  Thread calibration: mean lat.: 3093.539ms, rate sampling interval: 10985ms
  Thread calibration: mean lat.: 3062.424ms, rate sampling interval: 10772ms
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    11.89s     3.36s   17.74s    56.68%
    Req/Sec     4.13k    39.11     4.18k    60.00%
  623009 requests in 30.00s, 99.81MB read
Requests/sec:  20767.18
Transfer/sec:      3.33MB

Result

The QPS dropped from 22980.47 to 20767.18 with ~2k difference. I think this should be acceptable given the issue with events library this PR solves. According to me, it's an okay tradeoff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants