NineToothed is a Triton-based domain-specific language (DSL). By introducing tensor-oriented meta-programming (TOM), it makes writing high-performance GPU kernels easier.
We can use pip
to install ninetoothed
.
pip install ninetoothed
After successfully running the above command, ninetoothed
will be installed. However, to fully utilize its capabilities, you also need to install a deep learning framework supported by ninetoothed
. For trial purposes, we recommend installing torch
.
Thanks to tensor-oriented meta-programming, NineToothed can be written using the arrange-and-apply paradigm, which involves separately defining arrangement
, application
, and tensors
, and then integrating them using ninetoothed.make
to generate the kernel.
Here is the code we need for matrix multiplication:
import ninetoothed
import ninetoothed.language as ntl
from ninetoothed import Tensor, block_size
BLOCK_SIZE_M = block_size()
BLOCK_SIZE_N = block_size()
BLOCK_SIZE_K = block_size()
def arrangement(input, other, output):
output_arranged = output.tile((BLOCK_SIZE_M, BLOCK_SIZE_N))
input_arranged = input.tile((BLOCK_SIZE_M, BLOCK_SIZE_K))
input_arranged = input_arranged.tile((1, -1))
input_arranged = input_arranged.expand((-1, output_arranged.shape[1]))
input_arranged.dtype = input_arranged.dtype.squeeze(0)
other_arranged = other.tile((BLOCK_SIZE_K, BLOCK_SIZE_N))
other_arranged = other_arranged.tile((-1, 1))
other_arranged = other_arranged.expand((output_arranged.shape[0], -1))
other_arranged.dtype = other_arranged.dtype.squeeze(1)
return input_arranged, other_arranged, output_arranged
def application(input, other, output):
accumulator = ntl.zeros(output.shape, dtype=ntl.float32)
for k in range(input.shape[0]):
accumulator += ntl.dot(input[k], other[k])
output = accumulator
tensors = (Tensor(2), Tensor(2), Tensor(2))
kernel = ninetoothed.make(arrangement, application, tensors)
This project is distributed under the Apache-2.0 license. See the included LICENSE file for details.