forked from tensorflow/tensorflow
-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync 231206 #2321
Merged
Merged
Sync 231206 #2321
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
PiperOrigin-RevId: 585989541
This picks up, among other things, the fix for the invalid memcpy call in google/XNNPACK@07e1a4a The new XNNPACK requires a new cpuinfo, so update that too. PiperOrigin-RevId: 585992920
…tor private targets Imported from GitHub PR openxla/xla#7323 fixed rocm build due to openxla/xla@33fc605 @xla-rotation Copybara import of the project: -- ad859aa6fa0d44e2a7609eaee6bedbcd4d3968da by Chao Chen <cchen104@amd.com>: remove command_buffer and kernel links in rocm build Merging this change closes tensorflow#7323 PiperOrigin-RevId: 585997544
…nterface. Reimplement local collectives to utilize thread-parallelism, rather than having one thread do all the work. They are simpler this way! PiperOrigin-RevId: 585998597
triton_support - basic Triton support checks triton_tiling_propagation - The code for propagating the tilings in a functional paradigm triton_fusion_analysis - FusionContext and TritonFusionAnalysis gemm_rewriter_triton - GemmRewriterTriton PiperOrigin-RevId: 586006558
Adds parameter and return type annotations to the majority of public functions and methods in the `test_util` module. This includes annotations for methods on `TensorFlowTestCase` which return values, but omits the assertion methods. If adding types is currently infeasible (due to complexity of the signature, limitations of the supported versions of python, type checker limitations, etc.), then this change simply does not add those annotations. PiperOrigin-RevId: 586008562
…d we can remove the compatibility support here. PiperOrigin-RevId: 586020786
https://github.com/openxla/xla/pull/7277/files PiperOrigin-RevId: 586026438
Factored out a common pattern of mutating `NodeDef`s by iterating all node defs in a `GraphDef` into a templated function and applied it for both `enable_dump_tensor` and `change_dump_tensor_file_name`. PiperOrigin-RevId: 586028409
…ate* Also remove some unused #includes from the .cc files. Also use "= default" syntax for destructor. PiperOrigin-RevId: 586069299
… data size. This is an implementation detail. PiperOrigin-RevId: 586097271
There is apparently no feasible way of resolving the TODO comment. PiperOrigin-RevId: 586106981
PiperOrigin-RevId: 586130880
This adds the configs needed for us to be able to run a Linux Arm64 GitHub presubmit on incoming PRs. It runs tests by cross-compiling test binaries on remote Linux x86 VMs using RBE and then executing the built test binaries on the host Arm64 VM. On average, this presubmit should take about ~30 mins and is ~83% faster than the current GitHub Linux Arm64 presubmit (https://github.com/tensorflow/tensorflow/actions/workflows/arm-ci.yml). I have changed the name of the cross-compile env file to add the Python version it runs and to be consistent with other env names. PiperOrigin-RevId: 586144808
Imported from GitHub PR openxla/xla#7136 This PR add the `Allocate` command to command buffer. The `Allocate` command is constructed with the pointer to `BufferAllocation`. The allocation will be performed when the command is recorded, the allocated address will be tracked by command buffer runtime through allocation index. For the consumer commands who want to access the allocated buffer, the record parameter buffer address should be provided as se::DeviceMemoryBase with special address (LAZY_ALLOCATE_ADDRESS_MARKER) and non-zero size, and it can be created with API se::DeviceMemory<>::MakeLazyAllocAddressFromByteSize(byte_length); Below is an example how to construct command sequences that access buffers allocated inside command buffer: ``` BufferAllocation alloc_a(/*index=*/0, byte_length, /*color=*/0); BufferAllocation alloc_b(/*index=*/1, byte_length, /*color=*/0); BufferAllocation alloc_c(/*index=*/2, byte_length, /*color=*/0); BufferAllocation::Slice slice_a(&alloc_a, 0, byte_length); BufferAllocation::Slice slice_b(&alloc_b, 0, byte_length); BufferAllocation::Slice slice_c(&alloc_c, 0, byte_length); // Prepare commands sequence for constructing command buffer. CommandBufferCmdSequence commands; commands.Emplace<AllocateCmd>(&alloc_b); commands.Emplace<MemcpyDeviceToDeviceCmd>(slice_b, slice_a, byte_length); commands.Emplace<MemcpyDeviceToDeviceCmd>(slice_c, slice_b, byte_length); // Construct a thunk with command sequence. CommandBufferThunk thunk(std::move(commands), Thunk::ThunkInfo(nullptr)); // Prepare arguments: a=42, b=0 se::DeviceMemory<int32_t> a = executor->AllocateArray<int32_t>(length, 0); stream.ThenMemset32(&a, 42, byte_length); se::DeviceMemory<int32_t> b = se::DeviceMemory<int32_t>::MakeLazyAllocAddressFromByteSize(byte_length); se::DeviceMemory<int32_t> c = executor->AllocateArray<int32_t>(length, 0); BufferAllocations allocations({a, b, c}, 0, executor->GetAllocator()); ServiceExecutableRunOptions run_options; Thunk::ExecuteParams params(run_options, allocations, &stream, {}); // Execute command buffer thunk and verify that it copied the memory. TF_ASSERT_OK(thunk.ExecuteOnStream(params)); ``` For CUDA implementation, the command has no update parameters, which means that when the command is added to command buffer, the address range allocated for this command is fixed across command buffer launches. The `Allocation` command is only implemented for CUDA platform Copybara import of the project: -- d2cdd0423fe5947e06d8d7b8d5192a8845b2beae by Shawn Wang <shawnw@nvidia.com>: Add Allocate command to command buffer Merging this change closes tensorflow#7136 PiperOrigin-RevId: 586150993
These are not ready yet. PiperOrigin-RevId: 586160312
PiperOrigin-RevId: 586163360
…to generate sharding strategies. For those that cannot be, we rely on pre-existing convolution handling code. PiperOrigin-RevId: 586163568
…n in a While command PiperOrigin-RevId: 586164354
…ation test is skipped for now because the full support for convolution is not implemented. Refactored the target op quantization pattern matching to compatible with dot-like ops. PiperOrigin-RevId: 586166124
PiperOrigin-RevId: 586166961
PiperOrigin-RevId: 586169003
…vides built-in utilities for saving & loading). PiperOrigin-RevId: 586169032
PiperOrigin-RevId: 586173556
Our Mac builds require some specific build environment setup such as installing Bazelisk, upgrading Pyenv, installing Python, etc. Since these scripts are meant to be run by both internal CI builds and external users, we re-work some conditional logic that were previously only meant to run for internal CI builds. These will now instead use the `TFCI_*_ENABLE` variables. This makes the conditionals from being possibly confusing system checks in scripts to explicit settings in "envs" files and allows both internal CI builds and external users to decide if they want to enable or disable a particular macOS build environment setup task. PiperOrigin-RevId: 586173730
PiperOrigin-RevId: 586180704
PiperOrigin-RevId: 587849841
PiperOrigin-RevId: 587851943
…or_pass. PiperOrigin-RevId: 587856220
…tandalone utility PiperOrigin-RevId: 587857877
PiperOrigin-RevId: 587862987
Block and thread dimensions already available in device kernels, so there should no reason to add extra kernel parameters for them. For CUTLASS gemm args packing we know thread dimensions statically from an operation template. PiperOrigin-RevId: 587863262
PiperOrigin-RevId: 587865806
1. Deduplicate the postprocessing code for dots and convs. 2. Combine the InferInputShardingForTopK function with GetInputSharding function, and get rid of an unused parameter in the later. PiperOrigin-RevId: 587868586
…a standalone utility PiperOrigin-RevId: 587869254
…ding HloValue in the producer instruction, if the producer instruction is a tuple. PiperOrigin-RevId: 587874472
PiperOrigin-RevId: 587875856
We got unlucky and hit a seed which happens to fail the KS test. PiperOrigin-RevId: 587885112
PiperOrigin-RevId: 587889341
PiperOrigin-RevId: 587893584
PiperOrigin-RevId: 587903563
…rd-swish-fusion-fp32-bf16 PiperOrigin-RevId: 587906135
…ingerprint; this contains information (like solver wall time) that can vary between runs. PiperOrigin-RevId: 587906345
…_sinking by doing a prepass to detect whether to construct a fusion. PiperOrigin-RevId: 587914972
…ion library. This change fixes a rare issue where two component functions are registered on a remote eager context, and their function libraries contain a function with the same name but a different body. When this happens the second registration fails due to a duplicate function upon adding it to the context-wide `FunctionLibraryDefinition`. To avoid this problem, when registering a component function, we use the `FunctionDefLibrary` shipped to create a private `FunctionLibraryDefinition` for running that function. We can do this relatively easily because the eager `ClusterFunctionLibraryRuntime` ships all reachable functions along with the root component function; and we have long-standing support for instantiating a function with an "overlay" `FunctionLibraryDefinition`. The behavior matches the TF1 `ClusterFunctionLibraryRuntime`, which ships an entire private library as part of the subgraph it registers with a remote worker, and creates a new `FunctionLibraryDefinition` and `ProcessFunctionLibraryRuntime` for that subgraph. Note that support for removing a component function via the `ClusterFunctionLibraryRuntime` was previously unsupported. We rely on this to simplify the ownership of the private `FunctionLibraryDefinition` objects, which are owned by the `EagerContext` and never deleted. Future support for remove would likely require using refcounted or otherwise-shared `FunctionLibraryDefinition` objects in the FLR stack. (In our experience, the issue is the result of an MLIR rewrite that canonicalizes the same source function in two different ways, so e.g. the choice of retained node for common subexpression elimination is different, but the two versions are functionally equivalent. In principle, making that rewrite deterministic, or making it choose a new name for the rewritten function would also solve the problem. However, I prefer this approach because it is robust to less-than-perfect rewrite passes, and we have a lot of rewrite passes.) PiperOrigin-RevId: 587920362
We need to install Bazelisk and Pyenv manually as these are not present on the x86 Mac VMs. Note that the uploads from these new jobs are disabled as they are not yet ready. However, the old Mac x86 nightly builds will still be running and upload to tf-nightly so there won't be any missing nightly packages while we are doing this migration. PiperOrigin-RevId: 587930871
jayfurmanek
force-pushed
the
sync-231206
branch
from
December 6, 2023 18:29
6f8c614
to
d011573
Compare
Conflicts: third_party/xla/xla/service/gpu/BUILD third_party/xla/xla/service/gpu/buffer_comparator_test.cc third_party/xla/xla/stream_executor/device_description.h third_party/xla/xla/stream_executor/rocm/hip_blas_lt.cc third_party/xla/xla/stream_executor/rocm/hip_blas_lt.h third_party/xla/xla/tests/BUILD
…nDef()`. Some compilers do not like using the name of a class as a method, which is fair enough. PiperOrigin-RevId: 588312567
draganmladjenovic
force-pushed
the
sync-231206
branch
from
December 12, 2023 19:19
2620692
to
d24bced
Compare
Retest Ubuntu-GPU-single please. |
Retest gpu-pycpp please. |
draganmladjenovic
approved these changes
Dec 15, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.