Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to output NVPTX assembly/IR/bytecode? #6410

Closed
mcourteaux opened this issue Nov 12, 2021 · 5 comments
Closed

How to output NVPTX assembly/IR/bytecode? #6410

mcourteaux opened this issue Nov 12, 2021 · 5 comments

Comments

@mcourteaux
Copy link
Contributor

I'm looking to find a way to inspect the NVPTX generated code for my pipelines. The statement files only contain calls to halide_cuda_run() to launch kernels. I am looking for the code itself. I found so far that if you use c_source as generator output, it will produce a very long C-file in which the code is hiding somewhere. However this fails in case my CPU scheduling part of the pipeline contains stuff that is not supported by the C-backend (like predicated load). I think that there is an output type missing for generators to just output the CUDA kernel assembly.

An example of kernel code hiding in the C-source is this:

static const char *_cuda_gpu_source_kernels_string = R"BUFCHARSOURCE(//
// Generated by LLVM NVPTX Back-End
//

.version 3.2
.target sm_20
.address_size 64

	// .globl	_kernel_logit_kernel_offset_s0_kernel_ko___block_id_x // -- Begin function _kernel_logit_kernel_offset_s0_kernel_ko___block_id_x
                                        // @_kernel_logit_kernel_offset_s0_kernel_ko___block_id_x
.visible .entry _kernel_logit_kernel_offset_s0_kernel_ko___block_id_x(
	.param .u64 _kernel_logit_kernel_offset_s0_kernel_ko___block_id_x_param_0,
	.param .u64 _kernel_logit_kernel_offset_s0_kernel_ko___block_id_x_param_1,
	.param .u64 _kernel_logit_kernel_offset_s0_kernel_ko___block_id_x_param_2,
	.param .u32 _kernel_logit_kernel_offset_s0_kernel_ko___block_id_x_param_3,
	.param .u32 _kernel_logit_kernel_offset_s0_kernel_ko___block_id_x_param_4,
	.param .u32 _kernel_logit_kernel_offset_s0_kernel_ko___block_id_x_param_5,
	.param .u32 _kernel_logit_kernel_offset_s0_kernel_ko___block_id_x_param_6
)
{
	.reg .pred 	%p<6>;
	.reg .f32 	%f<43>;
	.reg .b32 	%r<33>;
	.reg .b64 	%rd<21>;

// %bb.0:                               // %entry
	ld.param.u64 	%rd5, [_kernel_logit_kernel_offset_s0_kernel_ko___block_id_x_param_0];
	ld.param.u64 	%rd6, [_kernel_logit_kernel_offset_s0_kernel_ko___block_id_x_param_2];
	cvta.to.global.u64 	%rd1, %rd6;
	ld.param.u64 	%rd7, [_kernel_logit_kernel_offset_s0_kernel_ko___block_id_x_param_1];
	cvta.to.global.u64 	%rd2, %rd7;
	cvta.to.global.u64 	%rd3, %rd5;
	mov.u32 	%r1, %ctaid.x;
	ld.param.u32 	%r9, [_kernel_logit_kernel_offset_s0_kernel_ko___block_id_x_param_4];
	mov.u32 	%r2, %tid.x;
	setp.lt.s32 	%p1, %r1, %r9;
	@%p1 bra 	LBB0_1;
	bra.uni 	LBB0_2;
LBB0_1:                                 // %then_bb
	ld.param.u32 	%r7, [_kernel_logit_kernel_offset_s0_kernel_ko___block_id_x_param_5];
	shl.b32 	%r20, %r1, 5;

[... cutting because not relevant...]

	add.s32 	%r32, %r2, %r8;
	mov.u64 	%rd20, -32;
LBB0_3:                                 // %after_bb
	mul.wide.s32 	%rd16, %r32, 4;
	add.s64 	%rd17, %rd2, %rd16;
	shl.b64 	%rd18, %rd20, 2;
	add.s64 	%rd19, %rd17, %rd18;
	st.global.f32 	[%rd19], %f42;
	ret;
                                        // -- End function
}
)BUFCHARSOURCE";

All kernels to one file without the C-stuff around it, is really missing right now.

If desirable, this is definitely stuff I could contribute on, so feel free to give me some pointers on how to approach this, and I'll make a PR.

@mcourteaux
Copy link
Contributor Author

Still, this code is hard to interpret. I'd much rather see something along the lines of a statement file for GPU code (like we have stmt files for CPU code).

@abadams
Copy link
Member

abadams commented Nov 22, 2021

Setting the environment variable HL_DEBUG_CODEGEN=1 causes Halide to print the PTX. If you set it to 2 and ptxas is in the path, it also attempts to print the SASS.

@abadams
Copy link
Member

abadams commented Nov 22, 2021

Being able to emit ptx source in some way other than debug output would be better, and is on the TODO list: #5055

@mcourteaux
Copy link
Contributor Author

I'm currently testing with just putting the buffer that sits in the Module in the Stmt file if it ends with "_gpu_source_kernels". Not sure if it's super good idea, but at least it's in a collapsible button.

@abadams
Copy link
Member

abadams commented Nov 24, 2021

Fixed by #6444

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants