-
Notifications
You must be signed in to change notification settings - Fork 2.6k
feat: add support for multiple nacos instances and replace events library with shdict #12263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are quite a few issues. I have already checked the hot paths in the code (excluding the utils section). You can take a look and refute the errors I have pointed out.
apisix/discovery/nacos/init.lua
Outdated
ns_id, | ||
group_name, | ||
service_name) | ||
local value = nacos_dict:get_stale(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it necessary to retrieve potentially stale data using get_stale
instead of get
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to retrieve the old data when nacos is down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the value of an expired key is not guaranteed to be available so one should never rely on the availability of expired items.
try to use get_stale
to retrieve expired item is unstable way.
apisix/discovery/nacos/init.lua
Outdated
local waiting_time = 5 | ||
local step = 0.1 | ||
local logged = false | ||
while not value and waiting_time > 0 do | ||
if not logged then | ||
logged = true | ||
end | ||
|
||
if body and 'table' == type(body) then | ||
local err | ||
body, err = core.json.encode(body) | ||
if not body then | ||
return nil, 'invalid body : ' .. err | ||
ngx.sleep(step) | ||
waiting_time = waiting_time - step | ||
value = nacos_dict:get_stale(key) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this code mean?
Currently, nodes have been reduced to a mechanism that only reads from shdict. Before the code I selected, you did not attempt to initiate a request, so nodes would not suddenly be filled.
I understand that filling shdict is done by a decoupled privileged process, but I don't understand the value of using a spin lock-like mechanism for waiting here.
In my opinion, it should fast fail, which would cause APISIX to return a 503 error and print the log.
No matter how I look at it, it's strange that the request was blocked for 5 seconds because the service discovery provider did not return nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dict:get_stale is called before starting the wait. And if no value is found then it waits for "step" and tries again. It does that for 5 seconds. This waiting mechanism was already there before my PR. The only thing I have done here is replace get with get_stale for reasons mentioned in above comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It really depends on how you think the purpose of this PR is.
As I mentioned above, this logic is weird, so my suggestion is to remove it. Let this logic failfast instead of waiting for something.
If the upstream on a route is not loaded or does not exist at all, APISIX will obviously return 503 immediately.
You can request opinions from other reviewers here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nic-6443 What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aggre to remove it , this is weird.
apisix/discovery/nacos/init.lua
Outdated
function _M.dump_data() | ||
return {config = local_conf.discovery.nacos, services = applications or {}} | ||
return { | ||
config = local_conf.discovery.nacos, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that printing configurations are not secure.
If users are responsible for maintaining APISIX deployments, they can view these in the configuration files. The Control API itself should also be used by maintenance personnel, but we know that security controls on APIs are always difficult to configure correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so there is sensitive data in this configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will contain access_key
and secret_key
.
It does seem that it should not be presented to the API in plaintext. You can't ensure that the API will only allow localhost calls for security consistent with local files.
The fact that this API (Control API) can be configured to allow access from any IP, and also does not (unsupported) turn on TLS, amounts to creating the possibility for a credential to be transmitted over the internet in plaintext.
Moreover, what is the point of passing this configuration through the API data, as an authorized system administrator who can access the configuration file at any time, or drive configuration changes through some CI/CD system. The significance of providing this convenience via the API seems trivial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the configuration
start = start, | ||
stop = stop, | ||
fetch_instances = fetch_instances, | ||
fetch_full_registry = fetch_full_registry, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to APISIX convention, you should use the meta table and _index
. I don't remember us having such a precedent.
apisix/discovery/nacos/factory.lua
Outdated
["username"] = username, | ||
["password"] = password, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
["username"] = username, | |
["password"] = password, | |
username = username, | |
password = password, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apisix/discovery/nacos/factory.lua
Outdated
method = method, | ||
headers = headers, | ||
body = body, | ||
ssl_verify = false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems unsafe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its okay to allow for non tls traffic for service discovery. Eureka service discovery code also has the same logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is that hardcoding to disable TLS certificate validation should not be done. If it needs to default to false
, then it should be configurable; otherwise, it should be hard-coded to true
.
Anyway, "no certificate validation at all times" isn't acceptable security-wise.
I think we should do the right thing instead of pleasing the lazy. Not to mention that we already have ca-certificates
installed for container deployment and APISIX will use it by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay I hardcoded it to true
if not curr_services_in_use[config.id] then | ||
curr_services_in_use[config.id] = {} | ||
end | ||
for name in pairs(service_names) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not assign service_names
to curr_services_in_use[config.id]
directly?
for name in pairs(curr_services_in_use[config.id]) do
if not service_names[name] then
nacos_dict:delete(name)
end
end
curr_services_in_use[config.id] = service_names
-- maximum waiting time: 5 seconds | ||
local waiting_time = 5 | ||
local step = 0.1 | ||
local logged = false | ||
while not value and waiting_time > 0 do | ||
if not logged then | ||
logged = true | ||
end | ||
|
||
if body and 'table' == type(body) then | ||
local err | ||
body, err = core.json.encode(body) | ||
if not body then | ||
return nil, 'invalid body : ' .. err | ||
ngx.sleep(step) | ||
waiting_time = waiting_time - step | ||
value = nacos_dict:get(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Revolyssup why no any changes for this? #12263 (comment), we already discuss to remove it?
-- Now we use control plane to list the services | ||
function _M.list_all_services() | ||
return {} | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's this ? it's unused function, please remove it.
Description
discovery.nacos
as an array of nacos instances each with unique ID.NOTE: The older way of configuring nacos is still supported and older data will be valid. The older configuration now gets converted internally to new configuration before being passed onto the logic therefore making this change backwards compatible. Meaning users can use the single-instance configuration as well as multi instance configuration.
Fixes #11252
Checklist