Enable sampling multiple nodes per "example" #155
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We have a use case where we'd like to sample against multiple nodes at the same time. For instance, we'd like to generate subgraphs for both
A
andB
in the below graph.I'd like to be able to specify that input_nodes =
[[A, B], [C, D]]
as inputs toDistNeighborLoader
1 so that we sample againstA, B
for one example andB, C
for another example.While it is possible to achieve the same functionality via batching, I feel like it will be very tedious to keep track of both the "example count" and "batch size" and how they interplay.
Additionally, if we must batch here, then we would need to construct our
input_nodes
very carefully, which I feel is also quite error prone.Without this change, when we are iterating through the
NodeSamplerInput
in_sampling_worker_loop
2, if we have something likeNodeSamplerInput([[0, 1], [2, 3]])
then when the dataloader outputstensor([0])
we would pass intensor([[0, 1]])
, whose shape is (1, 2) to the inducer 3, and the inducer would useseed_size = tensor([[0, 1]]).size(0) = 1
I feel like the existing behavior is somewhat surprising, as it essentially means if you pass in a tensor of greater than dim 1, some of your inputs get ignored.
Alternatives:
I considered making it so we could pass in
NodeSamplerInput
(or a subclass) toDisNeighborLoader
but that seemed quite complicated given that the sampler input has some interplay with the worker options/etc and I wasn't sure how to consolidate all of the types ofinput_data
(input_data: list[RemoteNodeSplitSamplerInput] | RemoteNodeSplitSamplerInput | RemoteNodePathSamplerInput | NodeSamplerInput
) 4