Skip to content

Missing DB Connection Resiliency #549

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dlinch opened this issue Apr 14, 2025 · 3 comments
Open

Missing DB Connection Resiliency #549

dlinch opened this issue Apr 14, 2025 · 3 comments

Comments

@dlinch
Copy link

dlinch commented Apr 14, 2025

Last week, our provisioned Postgres instance was suddenly dropped for our production app. Heroku reported it as a hardware failure on their end, and for ~10 minutes our database associated to SolidCache and SolidQueue was unreachable.

I went looking for what I thought was already an available option, a default fallback behavior of perform_now in the scenario that a job fails to enqueue. I was only able to find an example where ActiveJob allows you to pass a block to any perform_later, and you can check the enqueue_status of that job. That, or having to figure out what exceptions would be raised in that scenario and add a custom rescue block around every place we call perform_later. To me, it makes sense to have some kind of top-level configuration to opt-in to a fallback strategy of performing jobs in-line. I would rather have slower request times for a short while as our hardware recovers versus throwing exceptions and requiring manual intervention to retry jobs from mission control, or having to add the same block to every single instance of perform_later throughout our application for such an edge case.

I read the section where the EnqueueError is raised, but that is a pretty standard class.name + error.message string message, and wouldn't give me the necessary information to rerun the job. Plus, it only seems to be raised upon a rescue of ActiveRecord::ActiveRecordError, which I'm not sure would be raised from an actual DB connection error. So, I don't think there is a reasonable option for a top-level rescue_from block to attempt this either.

Is there an appetite to handle this kind of scenario? My thoughts are a new top-level configuration option for the app for opt-in-only behavior, and potentially updating the set options to accept a per-call override. I get this is an edge case, but this kind of stuff is bound to happen sooner or later to production applications.

I made a similar ask over in the Solid Cache Repo, as that is actually the spot that took down every request to our app during that time. For having such a reasonable fallback scenario for both our cache and job runner, it seems silly that we just default to exceptions rather than running things in-line.

@rosa
Copy link
Member

rosa commented Apr 15, 2025

Hmmm... while for Solid Cache, you might want to simply fail open in that case, I'm not sure performing jobs inline is desirable in general. Have you found this behaviour in any other Active Job adapter? Either using the DB or using another data store that could also be unavailable.

@dlinch
Copy link
Author

dlinch commented Apr 16, 2025

@rosa The only thing I could find was for SideKiq's pro plan having some built-in resilience.

For your second point, that is why I opened the issue for discussion. Releasing a new major version that simply defaulted to inline doesn't make sense to me, but having the option to opt in would be incredibly valuable toward building a robust async job processor. I know in our situation, we would have rather our jobs performed with at least some partial success rate rather than at a 100% failure rate with data loss, the failures won't be logged into the failed executions table either, given the DB is missing in this scenario.

I prefer the choice to fall back to in-line; locally, when I drop my solid DB and attempt to enqueue a job from the console, I get this error:

SolidQueue::Job::EnqueueError (ActiveRecord::NoDatabaseError: We could not find your database: bright-finder/solid. Available database configurations can be found in config/database.yml.

I believe this job run will be gone forever; it's a simple Exception raise, and it also doesn't include any arguments for me to attempt to rerun manually if I were at a small scale and wanted to. It won't make sense for everyone, and for huge apps I can't imagine they would turn this on, but most of our jobs would still execute in a timely manner, and having some configuration option on a per-job or per enqueue-call basis to skip this behavior would provide the flexibility most people would need anyway. I just know this feature would have gone a really long way to helping us limp through the DB outage we had, and I can't see a good reason not to implement this if it were truly opt-in by default.

@rosa
Copy link
Member

rosa commented Apr 17, 2025

I could find was for SideKiq's pro plan having some built-in resilience.

Oh! I didn't know this Sidekiq Pro feature, but keeping jobs in the process's memory is a great idea, I like that much more than running the jobs inline, and I think it makes much more sense. I'd be happy to review a PR implementing that (I won't have time to work on it myself, at least in the near future), possibly with an opt-in config flag.

when I drop my solid DB and attempt to enqueue a job from the console, I get this error:

SolidQueue::Job::EnqueueError (ActiveRecord::NoDatabaseError: We could not find your database: bright-finder/solid. Available database configurations can be found in config/database.yml.

I believe this job run will be gone forever; it's a simple Exception raise, and it also doesn't include any arguments for me to attempt to rerun manually if I were at a small scale and wanted to

Yes, it would, and the info to re-run manually would be earlier in the logs, I think 🤔 Or you'd have to figure out the arguments from the request. Still, if you drop your Solid Queue DB, I think you'd have much bigger problems than this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants