Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient implementation of Tensor::ones() for metal #2512

Merged
merged 9 commits into from
Oct 1, 2024

Conversation

AnubhabB
Copy link
Contributor

@AnubhabB AnubhabB commented Sep 28, 2024

While working with Tensor::ones() I realized the implementation is significantly slower than pytorch counterpart in metal. The original implementation would create a CpuStorage and move it to Gpu.

This implementation introduces a metal kernel function for Tensor::ones() impl.

Benchmarks (original vs new implementation)

Benchmark Code ran on Mac M1 Pro 16GB

Impl DType Low Mid High Throughput (GiB/s)
Original(cpu) u8 36.752 µs 39.081 µs 40.932 µs 24.988
Original(metal) u8 226.73 µs 230.18 µs 233.76 µs 4.2427
New Impl(metal) u8 14.773 µs 14.877 µs 14.970 µs 4.2427
Original(cpu) u32 349.63 µs 352.22 µs 355.47 µs 11.090
Original(metal) u32 1.2362 ms 1.2555 ms 1.2761 µs 3.1112
New Impl(metal) u32 25.443 µs 25.515 µs 25.572 µs 153.09
Original(cpu) f32 351.63 µs 355.10 µs 359.24 µs 11.001
Original(metal) f32 1.2358 ms 1.2552 ms 1.2800 ms 3.1121
New Impl(metal) f32 24.859 µs 24.955 µs 25.039 µs 156.53
Original(cpu) bf16 179.12 µs 181.66 µs 184.41 µs 10.752
Original(metal) bf16 634.81 µs 650.41 µs 667.20 µs 3.0029
New Impl(metal) bf16 16.683 µs 16.782 µs 16.871 µs 116.38
Original(cpu) f16 175.30 µs 177.45 µs 180.00 µs 11.007
Original(metal) f16 636.60 µs 648.13 µs 662.07 µs 3.0135
New Impl(metal) f16 16.710 µs 16.767 µs 16.818 µs 116.49
Original(cpu) i64 703.17 µs 708.68 µs 714.85 µs 11.024
Original(metal) i64 2.2823 ms 2.3245 ms 2.3691 ms 3.3609
New Impl(metal) i64 42.224 µs 42.342 µs 42.437 µs 184.51

Looks like we are getting some significant performance boost.

Implementation Details:

  • MetalDevice::ones_impl now introduces a code-path similar to the random implementation - falls back to CpuStorage based implementation for DType::f64
  • Introduces a kernel fill - following the pattern implemented for normal metal kernels
  • Expanded candle-core test cases to test for bf16 and f16 types
  • Added test-case to candle-metal-kernels for the constant_fill related code introduced

Please let me know if I'm missing something!


encoder.use_resource(output, metal::MTLResourceUsage::Write);

let grid_size = MTLSize { width: length as u64, height: 1, depth: 1 };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is length here correct? Shouldn't it divided by the number of threads? Anyway better to use utils::linear_split as it should do the proper thing here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was stupid of me. Resolved.

candle-metal-kernels/src/fill.metal Show resolved Hide resolved
@AnubhabB
Copy link
Contributor Author

AnubhabB commented Oct 1, 2024

@LaurentMazare rustfmt is failing .. should I run a cargo fmt --all and push changes?

@LaurentMazare
Copy link
Collaborator

Yes please.

@LaurentMazare LaurentMazare merged commit a2bcc22 into huggingface:main Oct 1, 2024
8 of 10 checks passed
@LaurentMazare
Copy link
Collaborator

Thanks!

@AnubhabB AnubhabB deleted the ones-impl branch October 15, 2024 08:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants