Description
Last week, our provisioned Postgres instance was suddenly dropped for our production app. Heroku reported it as a hardware failure on their end, and for ~10 minutes our database associated to SolidCache and SolidQueue was unreachable.
I went looking for what I thought was already an available option, a default fallback behavior of perform_now
in the scenario that a job fails to enqueue. I was only able to find an example where ActiveJob allows you to pass a block to any perform_later, and you can check the enqueue_status of that job. That, or having to figure out what exceptions would be raised in that scenario and add a custom rescue block around every place we call perform_later
. To me, it makes sense to have some kind of top-level configuration to opt-in to a fallback strategy of performing jobs in-line. I would rather have slower request times for a short while as our hardware recovers versus throwing exceptions and requiring manual intervention to retry jobs from mission control, or having to add the same block to every single instance of perform_later throughout our application for such an edge case.
I read the section where the EnqueueError is raised, but that is a pretty standard class.name + error.message
string message, and wouldn't give me the necessary information to rerun the job. Plus, it only seems to be raised upon a rescue of ActiveRecord::ActiveRecordError
, which I'm not sure would be raised from an actual DB connection error. So, I don't think there is a reasonable option for a top-level rescue_from
block to attempt this either.
Is there an appetite to handle this kind of scenario? My thoughts are a new top-level configuration option for the app for opt-in-only behavior, and potentially updating the set
options to accept a per-call override. I get this is an edge case, but this kind of stuff is bound to happen sooner or later to production applications.
I made a similar ask over in the Solid Cache Repo, as that is actually the spot that took down every request to our app during that time. For having such a reasonable fallback scenario for both our cache and job runner, it seems silly that we just default to exceptions rather than running things in-line.