Skip to content

feat: add support for multiple nacos instances and replace events library with shdict #12263

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 50 commits into
base: master
Choose a base branch
from

Conversation

Revolyssup
Copy link
Contributor

@Revolyssup Revolyssup commented May 28, 2025

Description

  • Currently nacos discovery uses event library and during testing it was found out that when using lua-resty-events, sometimes not all workers get the events. And inconsistencies emerge. Moreover the idiomatic way to share data between workers is through a shared dict. Therefore this PR migrates nacos from older events mechanism to newer shared dict way. Now only priviliged agent will fetch data from nacos and write to shdict while all workers will read from it.
  • Currently nacos supports only a single nacos registry but users have use case of supporting multiple nacos registries. This PR adds support to specify discovery.nacos as an array of nacos instances each with unique ID.

NOTE: The older way of configuring nacos is still supported and older data will be valid. The older configuration now gets converted internally to new configuration before being passed onto the logic therefore making this change backwards compatible. Meaning users can use the single-instance configuration as well as multi instance configuration.

Fixes #11252

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label May 28, 2025
@Revolyssup Revolyssup marked this pull request as draft May 28, 2025 08:23
@dosubot dosubot bot added the enhancement New feature or request label May 28, 2025
@Revolyssup Revolyssup marked this pull request as ready for review May 28, 2025 19:28
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels May 28, 2025
Copy link
Contributor

@bzp2010 bzp2010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are quite a few issues. I have already checked the hot paths in the code (excluding the utils section). You can take a look and refute the errors I have pointed out.

ns_id,
group_name,
service_name)
local value = nacos_dict:get_stale(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to retrieve potentially stale data using get_stale instead of get?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to retrieve the old data when nacos is down.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the value of an expired key is not guaranteed to be available so one should never rely on the availability of expired items.

try to use get_stale to retrieve expired item is unstable way.

Comment on lines 52 to 63
local waiting_time = 5
local step = 0.1
local logged = false
while not value and waiting_time > 0 do
if not logged then
logged = true
end

if body and 'table' == type(body) then
local err
body, err = core.json.encode(body)
if not body then
return nil, 'invalid body : ' .. err
ngx.sleep(step)
waiting_time = waiting_time - step
value = nacos_dict:get_stale(key)
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this code mean?

Currently, nodes have been reduced to a mechanism that only reads from shdict. Before the code I selected, you did not attempt to initiate a request, so nodes would not suddenly be filled.
I understand that filling shdict is done by a decoupled privileged process, but I don't understand the value of using a spin lock-like mechanism for waiting here.

In my opinion, it should fast fail, which would cause APISIX to return a 503 error and print the log.

No matter how I look at it, it's strange that the request was blocked for 5 seconds because the service discovery provider did not return nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dict:get_stale is called before starting the wait. And if no value is found then it waits for "step" and tries again. It does that for 5 seconds. This waiting mechanism was already there before my PR. The only thing I have done here is replace get with get_stale for reasons mentioned in above comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It really depends on how you think the purpose of this PR is.

As I mentioned above, this logic is weird, so my suggestion is to remove it. Let this logic failfast instead of waiting for something.

If the upstream on a route is not loaded or does not exist at all, APISIX will obviously return 503 immediately.

You can request opinions from other reviewers here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nic-6443 What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aggre to remove it , this is weird.

function _M.dump_data()
return {config = local_conf.discovery.nacos, services = applications or {}}
return {
config = local_conf.discovery.nacos,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that printing configurations are not secure.

If users are responsible for maintaining APISIX deployments, they can view these in the configuration files. The Control API itself should also be used by maintenance personnel, but we know that security controls on APIs are always difficult to configure correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so there is sensitive data in this configuration.

Copy link
Contributor

@bzp2010 bzp2010 Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will contain access_key and secret_key.

It does seem that it should not be presented to the API in plaintext. You can't ensure that the API will only allow localhost calls for security consistent with local files.
The fact that this API (Control API) can be configured to allow access from any IP, and also does not (unsupported) turn on TLS, amounts to creating the possibility for a credential to be transmitted over the internet in plaintext.

Moreover, what is the point of passing this configuration through the API data, as an authorized system administrator who can access the configuration file at any time, or drive configuration changes through some CI/CD system. The significance of providing this convenience via the API seems trivial.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the configuration

Comment on lines +263 to +266
start = start,
stop = stop,
fetch_instances = fetch_instances,
fetch_full_registry = fetch_full_registry,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to APISIX convention, you should use the meta table and _index. I don't remember us having such a precedent.

Comment on lines 102 to 103
["username"] = username,
["password"] = password,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
["username"] = username,
["password"] = password,
username = username,
password = password,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

method = method,
headers = headers,
body = body,
ssl_verify = false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems unsafe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its okay to allow for non tls traffic for service discovery. Eureka service discovery code also has the same logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that hardcoding to disable TLS certificate validation should not be done. If it needs to default to false, then it should be configurable; otherwise, it should be hard-coded to true.

Anyway, "no certificate validation at all times" isn't acceptable security-wise.

I think we should do the right thing instead of pleasing the lazy. Not to mention that we already have ca-certificates installed for container deployment and APISIX will use it by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay I hardcoded it to true

if not curr_services_in_use[config.id] then
curr_services_in_use[config.id] = {}
end
for name in pairs(service_names) do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not assign service_names to curr_services_in_use[config.id] directly?

        for name in pairs(curr_services_in_use[config.id]) do
            if not service_names[name] then
                nacos_dict:delete(name)
            end
        end
        curr_services_in_use[config.id] = service_names

Comment on lines +45 to +56
-- maximum waiting time: 5 seconds
local waiting_time = 5
local step = 0.1
local logged = false
while not value and waiting_time > 0 do
if not logged then
logged = true
end

if body and 'table' == type(body) then
local err
body, err = core.json.encode(body)
if not body then
return nil, 'invalid body : ' .. err
ngx.sleep(step)
waiting_time = waiting_time - step
value = nacos_dict:get(key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Revolyssup why no any changes for this? #12263 (comment), we already discuss to remove it?

Comment on lines +117 to +120
-- Now we use control plane to list the services
function _M.list_all_services()
return {}
end
Copy link
Member

@nic-6443 nic-6443 Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this ? it's unused function, please remove it.

@Revolyssup Revolyssup marked this pull request as draft June 22, 2025 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request size:XXL This PR changes 1000+ lines, ignoring generated files.
Projects
Status: 👀 In review
Development

Successfully merging this pull request may close these issues.

bug: nacos node failure should not corrupt in-memory data
4 participants