Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Commit 4149f8b

Browse files
ptrendxDickJC123
authored andcommitted
Pointwise fusion for GPU (#15167)
* Beginning of RTC of pointwise ops * Code generation from the given JSON * add initial simple_partition_pass and use it for pointwise fusion * fix the fusion, use a symbol.Copy() at the beginning of binding function, use the name of input nodes in the cuda code * Fixes * Adding support for attribute inference for backward nodes when fusing * keep proper input ordering for fused Op * instantiate the indexed_graph before starting the subgraph replacement, return a new graph to reset the indexed_graph * Fuse backward * fix ordering of subgraph node inputs using subgraph topological ordering instead of main graph topological ordering, add tvm.patch * excluse forward node fusion during the fusion of the nodes in the backward graph * Dealing with fused backward nodes inferattr * use subgraph.indexed_graph() instead of main for _FusedOpHelper nodes node_id, invert control_deps loop to modify topology of subgraph before calling its indexed_graph(), check that all node of the first DFSVisit are actually in the subgraph * Adding support for other reqs in codegen * Fix * Cleaning * Change the TVM submodule * More cleaning * Making linter happy * Do fusion only if default context is GPU * Fixes for tests Add powerscalar and rpowerscalar, fix return type of zero and one Cleaning, fixing lint Go back to proper TVM submodule * Fix the TVM commit * Fix lint * Guard fusion with MXNET_USE_CUDA * Fix * Fix clang-tidy * Add erf and erfinv backward * Gluon support for fusion * Cleaning * Cleaning and allow shape/type change in FusedOp * Fixing Gluon bugs * Fixing after rebase * Fixing race condition and guarding against races when using NVRTC * Cleaning and renaming FusedOp to _FusedOp * Going easy on Windows compiler * Disable fusion on Windows for now * Refactor InferAttr and InferShapeAttr * Added slice and half2 support to FusedOp * Fix lint errors * Added multiple types support for vector loading/storing * add slice fusion when it's at the beginning of subgraphs * Removed constant ndim assumption in fused op * Fix memory alignment issue in slice for FusedOp * Fixes * Fix lint errors * Do not include cuda_fp16.h * Refactor fused op op lists * Make linter happy * Changes from review * Fixes after rebase * Expand FusedOp support for slice * Fix for fp16 _zeros and _ones * Fix * Moving aux functions to unnamed namespace and detail namespace -> fusion namespace * Disabling fusion if it alters topological order of inputs * Print code only when env variable is set * Fix * Fix lint and 2 tests that specify the same names for multiple inputs * Fixes from review and disabling fusion of slice with non-default step * Add amp_cast to fusion, fixes * Add amp_multicast and its backward to the list of support ops * Apply wording suggestions from code review Co-Authored-By: Aaron Markham <[email protected]> * Apply wording suggestions from code review Co-Authored-By: Aaron Markham <[email protected]> * Make clearer comment * Adding punctuation and capitalization to \brief descriptions * Fix * Fix * Add backward_cast to fusion * Adding unittests for fusion. Fix for erfinv_grad * Adding slice ops and add_n to tests * Fixes from review * Setting inplace option * Fix lint * Storing double in half * Retrigger CI * Slight relaxing of the relative tolerance in the test * Move the env variable check to the end * Fix a race condition between InferShape and scheduled Forward * Fix flakey test_fusion test involving fp32 erfinv op. * Fix from review * Added broadcast_like and slice_like to fused op * Minor fix and cleanup * Added negative axis support in slice_axis, temporarily disabled fusion of slice_like and broadcast_like * Added axes support to slice_like * Added axis support to broadcast_like * Add fast_load_slice function to fused op code * Added runtime switch for choosing fast and slow slice kernel * Fix lint and warning * Going easy on Windows compiler (again) * Fix slice_like * Debug broadcast_like fusion * Fix lint * Fix lint * Trigger CI * Get rid of the initializer list * Fix backward calls with different gradient type * avoid cycle when adding node specific for inputs of subgraph for pointwise fusion * Fix lint * Add namespace to the fusion implementations * Set launch bounds on the fused kernel * Fix NumPy tests * Test showcasing an issue fixed in PR #16553 * Cast scalarts to FP32 and perform (a*1.0/b) instead of (a/b) Fix lint errors Fix lint * Fix a bug in cycle detection for inputs only op in pointwise fusion * Add comments to simple_partition_pass.h file
1 parent ef19b09 commit 4149f8b

File tree

20 files changed

+3862
-216
lines changed

20 files changed

+3862
-216
lines changed

docs/static_site/src/pages/api/faq/env_var.md

Lines changed: 18 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -200,12 +200,12 @@ The following environments can be used to profile the application without changi
200200

201201
* MXNET_PROFILER_AUTOSTART
202202
- Values: 0(false) or 1(true) ```(default=0)```
203-
- Set to 1, MXNet starts the profiler automatically. The profiling result is stored into profile.json in the working directory.
203+
- Set to 1, MXNet starts the profiler automatically. The profiling result is stored into profile.json in the working directory.
204204

205205
* MXNET_PROFILER_MODE
206206
- Values: 0(false) or 1(true) ```(default=0)```
207-
- If set to '0', profiler records the events of the symbolic operators.
208-
- If set to '1', profiler records the events of all operators.
207+
- If set to '0', profiler records the events of the symbolic operators.
208+
- If set to '1', profiler records the events of all operators.
209209

210210
## Interface between Python and the C API
211211

@@ -241,14 +241,14 @@ If ctypes is used, it must be `mxnet._ctypes.ndarray.NDArrayBase`.
241241

242242
* MXNET_CUDA_ALLOW_TENSOR_CORE
243243
- 0(false) or 1(true) ```(default=1)```
244-
- If set to '0', disallows Tensor Core use in CUDA ops.
245-
- If set to '1', allows Tensor Core use in CUDA ops.
244+
- If set to '0', disallows Tensor Core use in CUDA ops.
245+
- If set to '1', allows Tensor Core use in CUDA ops.
246246
- This variable can only be set once in a session.
247247

248248
* MXNET_CUDA_TENSOR_OP_MATH_ALLOW_CONVERSION
249249
- 0(false) or 1(true) ```(default=0)```
250-
- If set to '0', disallows implicit type conversions to Float16 to use Tensor Cores
251-
- If set to '1', allows CUDA ops like RNN and Convolution to use TensorCores even with Float32 input data by using implicit type casting to Float16. Only has an effect if `MXNET_CUDA_ALLOW_TENSOR_CORE` is `1`.
250+
- If set to '0', disallows implicit type conversions to Float16 to use Tensor Cores
251+
- If set to '1', allows CUDA ops like RNN and Convolution to use TensorCores even with Float32 input data by using implicit type casting to Float16. Only has an effect if `MXNET_CUDA_ALLOW_TENSOR_CORE` is `1`.
252252

253253
* MXNET_CUDA_LIB_CHECKING
254254
- 0(false) or 1(true) ```(default=1)```
@@ -328,6 +328,17 @@ If ctypes is used, it must be `mxnet._ctypes.ndarray.NDArrayBase`.
328328
with float32.
329329
- Model accuracies do not necessarily improve with this environment variable turned on.
330330

331+
* MXNET_USE_FUSION
332+
- Values: 0(false) or 1(true) ```(default=1)```
333+
- If this variable is set, MXNet will try fusing some of the operations (pointwise operations only for now).
334+
- It works in Symbolic execution as well as in Gluon models hybridized with ```static_alloc=True``` option.
335+
- Only applies to MXNet that has been compiled with CUDA (```pip install mxnet-cuXX``` or built from source with ```USE_CUDA=1```) and running on GPU.
336+
337+
* MXNET_FUSION_VERBOSE
338+
- Values: 0(false) or 1(true) ```(default=0)```
339+
- Only applies to MXNet that has been compiled with CUDA and when ```MXNET_USE_FUSION``` option is enabled.
340+
- If this variable is set, MXNet will print the code for fused operators that it generated.
341+
331342
Settings for Minimum Memory Usage
332343
---------------------------------
333344
- Make sure ```min(MXNET_EXEC_NUM_TEMP, MXNET_GPU_WORKER_NTHREADS) = 1```

src/common/exec_utils.cc

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
/*
2+
* Licensed to the Apache Software Foundation (ASF) under one
3+
* or more contributor license agreements. See the NOTICE file
4+
* distributed with this work for additional information
5+
* regarding copyright ownership. The ASF licenses this file
6+
* to you under the Apache License, Version 2.0 (the
7+
* "License"); you may not use this file except in compliance
8+
* with the License. You may obtain a copy of the License at
9+
*
10+
* http://www.apache.org/licenses/LICENSE-2.0
11+
*
12+
* Unless required by applicable law or agreed to in writing,
13+
* software distributed under the License is distributed on an
14+
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
* KIND, either express or implied. See the License for the
16+
* specific language governing permissions and limitations
17+
* under the License.
18+
*/
19+
20+
/*!
21+
* \file exec_utils.cc
22+
* \brief Implementation of executor util functions.
23+
*/
24+
25+
#include "exec_utils.h"
26+
#include <unordered_set>
27+
#include <unordered_map>
28+
#include <string>
29+
30+
namespace mxnet {
31+
namespace common {
32+
33+
void CopyGraph(nnvm::Graph *dst, const nnvm::Graph &src, bool copy_variables) {
34+
using nnvm::Node;
35+
using nnvm::NodePtr;
36+
using nnvm::NodeEntry;
37+
std::unordered_map<Node*, NodePtr> old_new;
38+
// use DFSVisit to copy all the nodes
39+
DFSVisit(src.outputs, [&old_new, copy_variables](const NodePtr& node) {
40+
NodePtr np;
41+
if (copy_variables || !node->is_variable()) {
42+
np = Node::Create();
43+
np->attrs = node->attrs;
44+
} else {
45+
np = node;
46+
}
47+
old_new[node.get()] = std::move(np);
48+
});
49+
// connect nodes of new graph
50+
for (const auto &kv : old_new) {
51+
for (const NodeEntry& e : kv.first->inputs) {
52+
Node *ptr = e.node.get();
53+
kv.second->inputs.emplace_back(NodeEntry{old_new[ptr], e.index, e.version});
54+
}
55+
for (const NodePtr& p : kv.first->control_deps) {
56+
kv.second->control_deps.emplace_back(old_new[p.get()]);
57+
}
58+
}
59+
// set the head
60+
for (const NodeEntry &e : src.outputs) {
61+
(*dst).outputs.emplace_back(NodeEntry{old_new[e.node.get()], e.index, e.version});
62+
}
63+
}
64+
65+
bool CheckForInputNameDuplicates(const nnvm::IndexedGraph &idx) {
66+
std::unordered_set<std::string> names;
67+
for (const auto& nid : idx.input_nodes()) {
68+
const std::string &name = idx[nid].source->attrs.name;
69+
if (names.count(name)) {
70+
LOG(WARNING) << "Variable name " << name << " is used more than once!";
71+
return false;
72+
}
73+
names.insert(name);
74+
}
75+
return true;
76+
}
77+
78+
} // namespace common
79+
} // namespace mxnet

src/common/exec_utils.h

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -621,6 +621,25 @@ inline nnvm::Graph AssignContext(nnvm::Graph g,
621621
return g;
622622
}
623623

624+
/*!
625+
* \brief Copy the graph, optionally leaving original Variable nodes.
626+
*
627+
* \param dst destination graph
628+
* \param src source graph being copied
629+
* \param copy_variable whether to copy or reuse Variable nodes from the
630+
* source graph
631+
*/
632+
void CopyGraph(nnvm::Graph *dst, const nnvm::Graph &src, bool copy_variables);
633+
634+
/*!
635+
* \brief Check whether graph contains any duplicated names in its inputs.
636+
*
637+
* \param idx Indexed graph being checked
638+
*
639+
* \return true if there are no duplicates, false otherwise
640+
*/
641+
bool CheckForInputNameDuplicates(const nnvm::IndexedGraph &idx);
642+
624643
} // namespace common
625644
} // namespace mxnet
626645
#endif // MXNET_COMMON_EXEC_UTILS_H_

src/executor/exec_pass.h

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,10 +34,34 @@
3434
#include <vector>
3535
#include <memory>
3636
#include <string>
37+
#include <utility>
38+
#include <tuple>
3739

3840
namespace mxnet {
3941
namespace exec {
4042

43+
template <typename Attr>
44+
using FAccessSubgraphAttr = std::function<std::tuple<const nnvm::NodePtr,
45+
std::vector<Attr>,
46+
std::vector<Attr>>
47+
(const NodeAttrs& attrs)>;
48+
49+
using FAccessSubgraphShape = FAccessSubgraphAttr<mxnet::TShape>;
50+
using FAccessSubgraphType = FAccessSubgraphAttr<int>;
51+
using FAccessSubgraphStorageType = FAccessSubgraphAttr<int>;
52+
53+
template <typename Attr>
54+
using FProvideSubgraphAttr = std::function<void (const NodeAttrs& attrs,
55+
const std::vector<nnvm::NodePtr> &nodes,
56+
const std::vector<std::vector<Attr>> &in_attrs,
57+
const std::vector<std::vector<Attr>> &out_attrs)>;
58+
using FProvideSubgraphShape = FProvideSubgraphAttr<mxnet::TShape>;
59+
using FProvideSubgraphType = FProvideSubgraphAttr<int>;
60+
using FProvideSubgraphStorageType = FProvideSubgraphAttr<int>;
61+
62+
using TIsFusion = bool;
63+
using TIsFusionHelper = bool;
64+
4165
/*! \brief reuse graph definition */
4266
using nnvm::Graph;
4367

@@ -170,6 +194,24 @@ void AttachOpResources(const Graph& g,
170194
*/
171195
Graph DetectInplaceAddTo(Graph g);
172196

197+
/*!
198+
* \brief Fuse pointwise operations in the forward pass.
199+
*
200+
* \param g input graph (needs to be entire graph, not just forward part)
201+
*
202+
* \return graph with fused pointwise operations in the forward pass
203+
*/
204+
Graph FusePointwiseForward(Graph&& g);
205+
206+
/*!
207+
* \brief Fuse pointwise operations in the backward pass.
208+
*
209+
* \param g input graph (needs to be entire graph, not just forward part)
210+
*
211+
* \return graph with fused pointwise operations in the backward pass
212+
*/
213+
Graph FusePointwiseBackward(Graph&& g);
214+
173215
/*!
174216
* \brief Infer shapes in the graph given the information.
175217
* \param graph The input graph.

src/executor/graph_executor.cc

Lines changed: 44 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626
#include <nnvm/graph.h>
2727
#include <nnvm/pass_functions.h>
2828
#include <vector>
29+
#include <set>
2930
#include <algorithm>
3031

3132
#include "./exec_pass.h"
@@ -337,6 +338,7 @@ nnvm::Graph GraphExecutor::InitFullGraph(nnvm::Symbol symbol,
337338
if (!need_grad_) return g;
338339
for (size_t i = 0; i < g.outputs.size(); ++i) {
339340
NodeEntry ngrad(nnvm::Node::Create(), 0, 0);
341+
ngrad.node->attrs.name = "_head_grad_" + std::to_string(i);
340342
head_grad_entry_.emplace_back(AttrHint(ngrad, g.outputs[i]));
341343
head_grad_map_[ngrad.node.get()] = i;
342344
}
@@ -377,6 +379,7 @@ nnvm::Graph GraphExecutor::InitFullGraph(nnvm::Symbol symbol,
377379
for (const auto &e : g_grad.outputs) {
378380
g.outputs.push_back(e);
379381
}
382+
380383
return g;
381384
}
382385

@@ -796,6 +799,7 @@ void GraphExecutor::Init(nnvm::Symbol symbol,
796799
const nnvm::NodeEntryMap<NDArray>& feed_dict) {
797800
nnvm::Graph g = InitGraph(symbol, default_ctx, ctx_map, in_arg_ctxes, arg_grad_ctxes,
798801
aux_state_ctxes, grad_req_types);
802+
799803
// The following code of shape and dtype inferences and argument
800804
// initialization is for simple_bind only. Regular bind operation
801805
// should do this differently.
@@ -976,6 +980,7 @@ Executor* GraphExecutor::Reshape(const bool partial_shaping,
976980
this);
977981
return exec;
978982
}
983+
979984
/*!
980985
* \brief This function is triggered by both simple_bind
981986
* and bind flows.
@@ -993,6 +998,41 @@ Graph GraphExecutor::InitGraph(nnvm::Symbol symbol,
993998
// setup gradient
994999
nnvm::Graph g = InitFullGraph(symbol, grad_req_types);
9951000

1001+
#if MXNET_USE_CUDA && !defined(_WIN32)
1002+
if (default_ctx.dev_mask() == Context::kGPU && dmlc::GetEnv("MXNET_USE_FUSION", true)) {
1003+
nnvm::Graph unoptimized_graph;
1004+
common::CopyGraph(&unoptimized_graph, g, false);
1005+
1006+
if (common::CheckForInputNameDuplicates(unoptimized_graph.indexed_graph())) {
1007+
g.attrs["num_forward_outputs"] = std::make_shared<nnvm::any>(num_forward_outputs_);
1008+
g = FusePointwiseForward(std::move(g));
1009+
g.attrs["num_forward_outputs"] = std::make_shared<nnvm::any>(num_forward_outputs_);
1010+
g = FusePointwiseBackward(std::move(g));
1011+
// Check the topological order of inputs
1012+
const auto &original_inputs = unoptimized_graph.indexed_graph().input_nodes();
1013+
const auto &new_inputs = g.indexed_graph().input_nodes();
1014+
if (original_inputs.size() != new_inputs.size()) {
1015+
LOG(WARNING)
1016+
<< "Number of inputs after fusion does not match original number of inputs. "
1017+
<< "This is most probably a bug. Disabling fusion for this run.";
1018+
g = unoptimized_graph;
1019+
} else {
1020+
for (size_t i = 0; i < new_inputs.size(); ++i) {
1021+
if (unoptimized_graph.indexed_graph()[original_inputs[i]].source->attrs.name !=
1022+
g.indexed_graph()[new_inputs[i]].source->attrs.name) {
1023+
LOG(WARNING) << "Disabling fusion due to altered topological order of inputs.";
1024+
g = unoptimized_graph;
1025+
break;
1026+
}
1027+
}
1028+
}
1029+
} else {
1030+
LOG(WARNING)
1031+
<< "Graph contains duplicate names for some of its inputs - fusion is NOT enabled!";
1032+
}
1033+
}
1034+
#endif // MXNET_USE_CUDA
1035+
9961036
// create "device" and "context" attrs for the graph
9971037
g = AssignContext(g, default_ctx, ctx_map,
9981038
in_arg_ctxes,
@@ -1946,7 +1986,7 @@ Executor *Executor::SimpleBind(nnvm::Symbol symbol,
19461986
symbol = exec::BuildSubgraph(symbol, backend, arg_shape_map, arg_dtype_map, arg_stype_map,
19471987
default_ctx, group2ctx, &tmp_in_arg_ctxes, &tmp_arg_grad_ctxes,
19481988
&tmp_grad_req_types, &tmp_aux_state_ctxes, verbose);
1949-
exec->Init(symbol, default_ctx, group2ctx, tmp_in_arg_ctxes, tmp_arg_grad_ctxes,
1989+
exec->Init(symbol.Copy(), default_ctx, group2ctx, tmp_in_arg_ctxes, tmp_arg_grad_ctxes,
19501990
tmp_aux_state_ctxes, arg_shape_map, arg_dtype_map, arg_stype_map,
19511991
tmp_grad_req_types, shared_arg_names, &tmp_in_args, &tmp_arg_grads,
19521992
&tmp_aux_states, shared_buffer, shared_exec);
@@ -1985,7 +2025,7 @@ Executor *Executor::SimpleBind(nnvm::Symbol symbol,
19852025
}
19862026
if (!init) {
19872027
// init without subgraph
1988-
exec->Init(symbol, default_ctx, group2ctx, in_arg_ctxes, arg_grad_ctxes, aux_state_ctxes,
2028+
exec->Init(symbol.Copy(), default_ctx, group2ctx, in_arg_ctxes, arg_grad_ctxes, aux_state_ctxes,
19892029
arg_shape_map, arg_dtype_map, arg_stype_map, grad_req_types, shared_arg_names,
19902030
in_args, arg_grads, aux_states, shared_buffer, shared_exec);
19912031
}
@@ -2017,8 +2057,8 @@ Executor *Executor::Bind(nnvm::Symbol symbol,
20172057
verbose);
20182058
}
20192059
}
2020-
exec->Init(symbol, default_ctx, group2ctx, tmp_in_args, tmp_arg_grad_store, tmp_grad_req_type,
2021-
tmp_aux_states, reinterpret_cast<Executor*>(shared_exec));
2060+
exec->Init(symbol.Copy(), default_ctx, group2ctx, tmp_in_args, tmp_arg_grad_store,
2061+
tmp_grad_req_type, tmp_aux_states, reinterpret_cast<Executor*>(shared_exec));
20222062
return exec;
20232063
}
20242064
} // namespace mxnet

0 commit comments

Comments
 (0)