Triton metaprogramming: how to parameterize a kernel with constexpr callable function? #7772

PgLoLo · 2024-11-07T10:48:58Z

PgLoLo
Nov 7, 2024

Hello,

I am struggling to find a way to parameterize a Triton kernel with a constexpr callable function. For instance, I have a function f that loads two tensors, applies an element-wise binary operation, and stores the result. I have several predefined binary operations, and I’d like to avoid duplicating code for each operation. However, the following approach results in an InternalTorchDynamoError: NotImplementedError: TritonKernelVariable():

import torch as t
import triton
from triton import language as tl


@triton.jit
def plus(x1, x2):
    return x1 + x2


@triton.jit
def minus(x1, x2):
    return x1 - x2


@triton.jit
def kernel(
    in_ptr1,
    in_ptr2,
    out_ptr,
    binary_op: tl.constexpr,
    numel: tl.constexpr,
    BLOCK_SIZE: tl.constexpr
):
    offset = tl.program_id(0) * BLOCK_SIZE
    index = offset + tl.arange(0, BLOCK_SIZE)
    mask = index < numel
    
    x1 = tl.load(in_ptr1 + index, mask, other=0.0)
    x2 = tl.load(in_ptr2 + index, mask, other=0.0)
    y = f(x1, x2)
    
    tl.store(out_ptr + index, y, mask)


@t.compile()
def function(x1: t.Tensor, x2: t.Tensor) -> t.Tensor:
    y = t.empty_like(x1)
    numel = x1.numel()
    grid = lambda meta: (triton.cdiv(numel, meta['BLOCK_SIZE']),)
    kernel[grid](x1, x2, y, plus, numel, BLOCK_SIZE=512)
    return y


x1 = t.ones(10)
x2 = (-t.ones(10))**t.arange(10)
function(x1.cuda(), x2.cuda())

The error is understandable here, but I don’t see why a compose-on-the-fly approach doesn’t work as well:

import torch as t
import triton
from triton import language as tl


@triton.jit
def plus(x1, x2):
    return x1 + x2


@triton.jit
def minus(x1, x2):
    return x1 - x2


def compose(binary_op):
    @triton.jit
    def kernel(
        in_ptr1,
        in_ptr2,
        out_ptr,
        numel: tl.constexpr,
        BLOCK_SIZE: tl.constexpr
    ):
        offset = tl.program_id(0) * BLOCK_SIZE
        index = offset + tl.arange(0, BLOCK_SIZE)
        mask = index < numel
        
        x1 = tl.load(in_ptr1 + index, mask, other=0.0)
        x2 = tl.load(in_ptr2 + index, mask, other=0.0)
        y = binary_op(x1, x2)
        
        tl.store(out_ptr + index, y, mask)
        
    @t.compile()
    def function(x1: t.Tensor, x2: t.Tensor) -> t.Tensor:
        y = t.empty_like(x1)
        numel = x1.numel()
        grid = lambda meta: (triton.cdiv(numel, meta['BLOCK_SIZE']),)
        kernel[grid](x1, x2, y, numel, BLOCK_SIZE=512)
        return y

    return function


function = compose(plus)

x1 = t.ones(10)
x2 = (-t.ones(10))**t.arange(10)
function(x1.cuda(), x2.cuda())

In this approach, we define kernel code specifically for the given binary_op function, but it still results in an error:

NameError('binary_op is not defined')

(The amusing part is that it’s just a warning, and the code executes something, but the output tensor contains garbage values.)

Am I missing something, or is it currently impossible to metaprogram Triton kernels in order to eliminate repetitive code?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Triton metaprogramming: how to parameterize a kernel with constexpr callable function? #7772

{{title}}

Replies: 0 comments

Select a reply

Triton metaprogramming: how to parameterize a kernel with constexpr callable function? #7772

PgLoLo Nov 7, 2024

Replies: 0 comments

PgLoLo
Nov 7, 2024