-
Notifications
You must be signed in to change notification settings - Fork 532
[REF] Cache nodes in workflow to speed up construction, other optimizations #3331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Networkx represents graphs as a dictionary where the nodes are the keys. As such, we can use the built-in __contains__ function for fast lookup - We can also iterate the graphs directly without constructing a NodeView using self._graph.nodes()
- Also avoids casting the list to np.array
- Update every time we connect/disconenct or add/remove a node - Keep track of which nodes are workflows and which are not - As a result, we do not need to iterate all nodes to determine the result of `_has_node`, we can use O(1) set operations
- Do not need to loop over entries - Faster O(1) __contains__
- Faster operations for update and contains
- The function is only adding/removing nodes and adjusting their connections - As such, there are no serial dependencies, and we can iterate in any order
- The only thing that changes about the regular expression is the prefix - We can also detect the prefix with startswith, and then use the same regular expression across the loop - This means that Python can cache the compiled regex internally, and we save some time
Codecov Report
@@ Coverage Diff @@
## master #3331 +/- ##
==========================================
+ Coverage 65.05% 68.31% +3.26%
==========================================
Files 302 302
Lines 39953 48781 +8828
Branches 5287 7224 +1937
==========================================
+ Hits 25992 33327 +7335
- Misses 12896 14234 +1338
- Partials 1065 1220 +155
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. This all looks correct. A few thoughts, but nothing blocking. Looking at the style check, it looks like black
released a new version that affects a lot of files, so it should not block this PR, either.
Co-authored-by: Chris Markiewicz <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two last thoughts, but otherwise this LGTM.
- Apply suggestions from code review Co-authored-by: Chris Markiewicz <[email protected]>
Thank you for your quick feedback! I always fell like I'm learning a lot from your code reviews |
Thank you! |
Summary
I'm working with fMRIPrep and large data sets (via the ENIGMA consortium). In my testing, Nipype workflows with more than a few hundred thousand nodes can be prohibitively slow, so that it can take days before processing starts. This comes specifically from the function
_generate_flatgraph
andgenerate_expanded_graph
. While my previous pull requests #3184 and #3260 fixed some bottlenecks, I recently had the time to do some more profiling and try to fix some more.I understand that Pydra is just around the corner, and that switching to it may solve my performance troubles then. However, the small changes outlined here can reduce the time before processing starts by approximately half (in my testing with a workflow of 75k nodes).
I have tried to divide up the changes into independent commits, so that you can decide which ones may be interesting for inclusion.
Here is my profile before starting out. Please not that the absolute durations are not interpretable, because the profiler reduces performance by a lot.

Here is the profile with all proposed changes. Note that

_create_flat_graph
has become the dark gray box without caption on the bottom right.List of changes proposed in this PR (pull-request)
The
workflow._has_node
function is called for every new connection, and currently iterates through all nodes in the workflow and nested workflows. It then compares each node to thewanted_node
. We can speed this up, becausenetworkx
graphs are dictionaries and have support thein
statement. This way we can reduce the scaling behavior from linear to almost constant time, but only in the best-case scenario that thewanted_node
is in the workflow.Otherwise, we still need to iterate to find all nested workflows. To overcome this last limitations, we can cache the nested workflows in a set, and update it whenever the graph is changes.
nipype/nipype/pipeline/engine/workflows.py
Lines 906 to 913 in 4e8f54d
The
_merge_graphs
function usesnp.unique
for a list of strings. We can use the built-inset
to avoid casting to Numpy array.np.unique
additionally sorts the input array, which we really don't need herenipype/nipype/pipeline/engine/utils.py
Line 756 in 4e8f54d
Use sets wherever order is not important and we just need to check whether a string or a node is contained or not, again reducing linear scaling to a constant time look-up.
nipype/nipype/pipeline/engine/workflows.py
Lines 173 to 174 in 4e8f54d
Remove a topological sort that may not be necessary. Please correct me if I'm wrong :-)
nipype/nipype/pipeline/engine/workflows.py
Line 942 in 4e8f54d
Avoid re-compiling regular expressions in a loop by using Python's built-in functions
nipype/nipype/pipeline/engine/utils.py
Lines 1114 to 1116 in 4e8f54d
Acknowledgment