Skip to content

fix(workflow): process all activation jobs as a single batch #1488

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Aug 14, 2024

Conversation

mjameswh
Copy link
Contributor

@mjameswh mjameswh commented Aug 7, 2024

What was changed

  • Previously, when processing a Workflow activation, the SDK would execute notifyHasPatch and signalWorkflow jobs in distinct phases, before other types of jobs. The primary reason behind that multi-phase algorithm was to avoid the possibility that a Workflow execution might complete before all incoming signals have been dispatched (at least to the point that the synchronous part of the handler function has been executed).

    This PR replaces that multi-phase algorithm with a simpler one where jobs are simply sorted as signals and updates -> others, but without processing them as distinct batches (i.e. without leaving/reentering the VM context between each group, which automatically triggers the execution of all outstanding microtasks). That single-phase approach resolves some rare, edge-case quirks of the former algorithm (including the one described in [Bug] Workflows can be constructed in which update handlers do not, but should, execute #1474), and yet still satisfies to the original requirement of ensuring that every signalWorkflow jobs - and now doUpdate jobs as well - have been given a proper chance to execute before the Workflow main function might complete.

    This change is gated by a new SDK flag, to preserve compatibility with workflow execution history produced by previous releases. However, there is no intention at this point of making that that flag rollback-safe.

  • Lang no longer stops processing jobs for the current activation after the workflow code emits a workflow completion command; instead, lang continues to process remaining jobs until the next stop point, after which it reports all generated commands to Core.

@mjameswh mjameswh requested a review from a team as a code owner August 7, 2024 21:30
@mjameswh mjameswh marked this pull request as draft August 7, 2024 22:32
@mjameswh mjameswh force-pushed the reorder-updates-with-signals branch from c4e0b53 to 2dec464 Compare August 8, 2024 07:08
@mjameswh mjameswh changed the title fix(workflow): Group update jobs with signals fix(workflow): process all activation jobs as a single batch Aug 9, 2024
@mjameswh mjameswh marked this pull request as ready for review August 9, 2024 11:16
Copy link
Member

@Sushisource Sushisource left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes good sense. Nothing blocking, but a few suggestions.

@@ -1853,7 +1894,7 @@ test('conditionRacer', async (t) => {
makeFireTimerJob(1)
)
);
compareCompletion(t, completion, makeSuccess([{ cancelTimer: { seq: 1 } }]));
compareCompletion(t, completion, makeSuccess([], [SdkFlags.ProcessWorkflowActivationJobsAsSingleBatch]));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this cancel timer go away?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That CancelTimer command was bogus. We just sent a FireTimer(1). Why would we cancel an already fired timer?

That timer comes from the condition(fn, timeout) line. In this case, fn will return true as soon as the signal handler is invoked. Since signals were procesed in a distinct phase, and timer events in a second phase, the microtasks created internally by condition() would have the opportunity to settle on a state where "fn returns true, because the signal has been received, but timer has not yet fired, because the job has not yet been processed".

ProcessWorkflowActivationJobsAsSingleBatch fixes that. As soon as the fn returns true, the internal condition promise (the one without a timer) gets resolved, but by the time the then microtask gets executed, the TimerFired event has also been processed, and so the attempt to cancel the timer is known to be a noop.

Just for extra safety, I modified that test to return the boolean from condition(fn, timeout), so we know that we didn't break the documented semantic of that API.

@mjameswh mjameswh force-pushed the reorder-updates-with-signals branch from 0d5d76e to 9c2eb5b Compare August 14, 2024 10:42
@mjameswh mjameswh merged commit d6e2738 into temporalio:main Aug 14, 2024
70 checks passed
@mjameswh mjameswh deleted the reorder-updates-with-signals branch November 22, 2024 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants