Efficient implementation of `Tensor::ones()` for `metal` #2512

AnubhabB · 2024-09-28T05:54:57Z

While working with Tensor::ones() I realized the implementation is significantly slower than pytorch counterpart in metal. The original implementation would create a CpuStorage and move it to Gpu.

This implementation introduces a metal kernel function for Tensor::ones() impl.

Benchmarks (original vs new implementation)

Benchmark Code ran on Mac M1 Pro 16GB

Impl	DType	Low	Mid	High	Throughput (GiB/s)
Original(cpu)	`u8`	36.752 µs	39.081 µs	40.932 µs	24.988
Original(metal)	`u8`	226.73 µs	230.18 µs	233.76 µs	4.2427
New Impl(metal)	`u8`	14.773 µs	14.877 µs	14.970 µs	4.2427
Original(cpu)	`u32`	349.63 µs	352.22 µs	355.47 µs	11.090
Original(metal)	`u32`	1.2362 ms	1.2555 ms	1.2761 µs	3.1112
New Impl(metal)	`u32`	25.443 µs	25.515 µs	25.572 µs	153.09
Original(cpu)	`f32`	351.63 µs	355.10 µs	359.24 µs	11.001
Original(metal)	`f32`	1.2358 ms	1.2552 ms	1.2800 ms	3.1121
New Impl(metal)	`f32`	24.859 µs	24.955 µs	25.039 µs	156.53
Original(cpu)	`bf16`	179.12 µs	181.66 µs	184.41 µs	10.752
Original(metal)	`bf16`	634.81 µs	650.41 µs	667.20 µs	3.0029
New Impl(metal)	`bf16`	16.683 µs	16.782 µs	16.871 µs	116.38
Original(cpu)	`f16`	175.30 µs	177.45 µs	180.00 µs	11.007
Original(metal)	`f16`	636.60 µs	648.13 µs	662.07 µs	3.0135
New Impl(metal)	`f16`	16.710 µs	16.767 µs	16.818 µs	116.49
Original(cpu)	`i64`	703.17 µs	708.68 µs	714.85 µs	11.024
Original(metal)	`i64`	2.2823 ms	2.3245 ms	2.3691 ms	3.3609
New Impl(metal)	`i64`	42.224 µs	42.342 µs	42.437 µs	184.51

Looks like we are getting some significant performance boost.

Implementation Details:

MetalDevice::ones_impl now introduces a code-path similar to the random implementation - falls back to CpuStorage based implementation for DType::f64
Introduces a kernel fill - following the pattern implemented for normal metal kernels
Expanded candle-core test cases to test for bf16 and f16 types
Added test-case to candle-metal-kernels for the constant_fill related code introduced

Please let me know if I'm missing something!

LaurentMazare · 2024-10-01T11:38:35Z

candle-metal-kernels/src/lib.rs

+
+    encoder.use_resource(output, metal::MTLResourceUsage::Write);
+
+    let grid_size = MTLSize { width: length as u64, height: 1, depth: 1 };


Is length here correct? Shouldn't it divided by the number of threads? Anyway better to use utils::linear_split as it should do the proper thing here.

this was stupid of me. Resolved.

candle-metal-kernels/src/fill.metal

…, using for call strategy

AnubhabB · 2024-10-01T15:38:12Z

@LaurentMazare rustfmt is failing .. should I run a cargo fmt --all and push changes?

LaurentMazare · 2024-10-01T16:04:48Z

Yes please.

LaurentMazare · 2024-10-01T17:12:18Z

Thanks!

AnubhabB added 6 commits September 27, 2024 11:57

WIP: hopefully better const impl

3e26b58

Merge branch 'main' of github.com:AnubhabB/candle into ones-impl

2398a13

with GPU

42980d6

More tests on

0784ff7

Reverting primitive for

2f37d25

Merge branch 'main' of github.com:AnubhabB/candle into ones-impl

453ae03

LaurentMazare reviewed Oct 1, 2024

View reviewed changes

AnubhabB added 2 commits October 1, 2024 19:18

Merge branch 'main' of github.com:AnubhabB/candle into ones-impl

942a617

Incorporating review changes - added check elem count check in kerner…

bc60875

…, using for call strategy

LaurentMazare approved these changes Oct 1, 2024

View reviewed changes

rustfmt ran

76e6b53

LaurentMazare approved these changes Oct 1, 2024

View reviewed changes

LaurentMazare merged commit a2bcc22 into huggingface:main Oct 1, 2024
8 of 10 checks passed

AnubhabB deleted the ones-impl branch October 15, 2024 08:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient implementation of `Tensor::ones()` for `metal` #2512

Efficient implementation of `Tensor::ones()` for `metal` #2512

AnubhabB commented Sep 28, 2024 •

edited

Loading

LaurentMazare Oct 1, 2024

AnubhabB Oct 1, 2024

AnubhabB commented Oct 1, 2024 •

edited

Loading

LaurentMazare commented Oct 1, 2024

LaurentMazare commented Oct 1, 2024


		encoder.use_resource(output, metal::MTLResourceUsage::Write);

		let grid_size = MTLSize { width: length as u64, height: 1, depth: 1 };

Efficient implementation of Tensor::ones() for metal #2512

Efficient implementation of Tensor::ones() for metal #2512

Conversation

AnubhabB commented Sep 28, 2024 • edited Loading

Benchmarks (original vs new implementation)

Implementation Details:

LaurentMazare Oct 1, 2024

Choose a reason for hiding this comment

AnubhabB Oct 1, 2024

Choose a reason for hiding this comment

AnubhabB commented Oct 1, 2024 • edited Loading

LaurentMazare commented Oct 1, 2024

LaurentMazare commented Oct 1, 2024

Efficient implementation of `Tensor::ones()` for `metal` #2512

Efficient implementation of `Tensor::ones()` for `metal` #2512

AnubhabB commented Sep 28, 2024 •

edited

Loading

AnubhabB commented Oct 1, 2024 •

edited

Loading