Missing DB Connection Resiliency

Last week, our provisioned Postgres instance was suddenly dropped for our production app. Heroku reported it as a hardware failure on their end, and for ~10 minutes our database associated to SolidCache and SolidQueue was unreachable.

I went looking for what I thought was already an available option, a default fallback behavior of `perform_now` in the scenario that a job fails to enqueue. I was only able to find an example where ActiveJob allows you to pass a block to any perform_later, and you can check the enqueue_status of that job. That, or having to figure out what exceptions would be raised in that scenario and add a custom rescue block around every place we call `perform_later`. To me, it makes sense to have some kind of top-level configuration to opt-in to a fallback strategy of performing jobs in-line. I would rather have slower request times for a short while as our hardware recovers versus throwing exceptions and requiring manual intervention to retry jobs from mission control, _or_ having to add the same block to every single instance of perform_later throughout our application for such an edge case.

I read the section where the EnqueueError is raised, but that is a pretty standard `class.name + error.message` string message, and wouldn't give me the necessary information to rerun the job. Plus, it only seems to be raised upon a rescue of `ActiveRecord::ActiveRecordError`, which I'm not sure would be raised from an actual DB connection error. So, I don't think there is a reasonable option for a top-level `rescue_from` block to attempt this either.

Is there an appetite to handle this kind of scenario? My thoughts are a new top-level configuration option for the app for opt-in-only behavior, and potentially updating the `set` options to accept a per-call override. I get this is an edge case, but this kind of stuff is bound to happen sooner or later to production applications.

I made a similar ask over in the [Solid Cache Repo](https://github.com/rails/solid_cache/issues/263), as that is actually the spot that took down every request to our app during that time. For having such a reasonable fallback scenario for both our cache and job runner, it seems silly that we just default to exceptions rather than running things in-line.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Missing DB Connection Resiliency #549

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Missing DB Connection Resiliency #549

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions