-
Notifications
You must be signed in to change notification settings - Fork 168
Missing DB Connection Resiliency #549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hmmm... while for Solid Cache, you might want to simply fail open in that case, I'm not sure performing jobs inline is desirable in general. Have you found this behaviour in any other Active Job adapter? Either using the DB or using another data store that could also be unavailable. |
@rosa The only thing I could find was for SideKiq's pro plan having some built-in resilience. For your second point, that is why I opened the issue for discussion. Releasing a new major version that simply defaulted to inline doesn't make sense to me, but having the option to opt in would be incredibly valuable toward building a robust async job processor. I know in our situation, we would have rather our jobs performed with at least some partial success rate rather than at a 100% failure rate with data loss, the failures won't be logged into the failed executions table either, given the DB is missing in this scenario. I prefer the choice to fall back to in-line; locally, when I drop my solid DB and attempt to enqueue a job from the console, I get this error:
I believe this job run will be gone forever; it's a simple Exception raise, and it also doesn't include any arguments for me to attempt to rerun manually if I were at a small scale and wanted to. It won't make sense for everyone, and for huge apps I can't imagine they would turn this on, but most of our jobs would still execute in a timely manner, and having some configuration option on a per-job or per enqueue-call basis to skip this behavior would provide the flexibility most people would need anyway. I just know this feature would have gone a really long way to helping us limp through the DB outage we had, and I can't see a good reason not to implement this if it were truly opt-in by default. |
Oh! I didn't know this Sidekiq Pro feature, but keeping jobs in the process's memory is a great idea, I like that much more than running the jobs inline, and I think it makes much more sense. I'd be happy to review a PR implementing that (I won't have time to work on it myself, at least in the near future), possibly with an opt-in config flag.
Yes, it would, and the info to re-run manually would be earlier in the logs, I think 🤔 Or you'd have to figure out the arguments from the request. Still, if you drop your Solid Queue DB, I think you'd have much bigger problems than this. |
Last week, our provisioned Postgres instance was suddenly dropped for our production app. Heroku reported it as a hardware failure on their end, and for ~10 minutes our database associated to SolidCache and SolidQueue was unreachable.
I went looking for what I thought was already an available option, a default fallback behavior of
perform_now
in the scenario that a job fails to enqueue. I was only able to find an example where ActiveJob allows you to pass a block to any perform_later, and you can check the enqueue_status of that job. That, or having to figure out what exceptions would be raised in that scenario and add a custom rescue block around every place we callperform_later
. To me, it makes sense to have some kind of top-level configuration to opt-in to a fallback strategy of performing jobs in-line. I would rather have slower request times for a short while as our hardware recovers versus throwing exceptions and requiring manual intervention to retry jobs from mission control, or having to add the same block to every single instance of perform_later throughout our application for such an edge case.I read the section where the EnqueueError is raised, but that is a pretty standard
class.name + error.message
string message, and wouldn't give me the necessary information to rerun the job. Plus, it only seems to be raised upon a rescue ofActiveRecord::ActiveRecordError
, which I'm not sure would be raised from an actual DB connection error. So, I don't think there is a reasonable option for a top-levelrescue_from
block to attempt this either.Is there an appetite to handle this kind of scenario? My thoughts are a new top-level configuration option for the app for opt-in-only behavior, and potentially updating the
set
options to accept a per-call override. I get this is an edge case, but this kind of stuff is bound to happen sooner or later to production applications.I made a similar ask over in the Solid Cache Repo, as that is actually the spot that took down every request to our app during that time. For having such a reasonable fallback scenario for both our cache and job runner, it seems silly that we just default to exceptions rather than running things in-line.
The text was updated successfully, but these errors were encountered: