Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync 231206 #2321

Merged
merged 397 commits into from
Dec 15, 2023
Merged

Sync 231206 #2321

merged 397 commits into from
Dec 15, 2023

Conversation

jayfurmanek
Copy link

No description provided.

mihaimaruseac and others added 30 commits November 28, 2023 07:53
This picks up, among other things, the fix for the
invalid memcpy call in google/XNNPACK@07e1a4a

The new XNNPACK requires a new cpuinfo, so update that too.

PiperOrigin-RevId: 585992920
…tor private targets

Imported from GitHub PR openxla/xla#7323

fixed rocm build due to openxla/xla@33fc605

@xla-rotation
Copybara import of the project:

--
ad859aa6fa0d44e2a7609eaee6bedbcd4d3968da by Chao Chen <cchen104@amd.com>:

remove command_buffer and kernel links in rocm build

Merging this change closes tensorflow#7323

PiperOrigin-RevId: 585997544
…nterface.

Reimplement local collectives to utilize thread-parallelism, rather than having one thread do all the work. They are simpler this way!

PiperOrigin-RevId: 585998597
triton_support - basic Triton support checks
triton_tiling_propagation - The code for propagating the tilings in a
functional paradigm
triton_fusion_analysis - FusionContext and TritonFusionAnalysis
gemm_rewriter_triton - GemmRewriterTriton

PiperOrigin-RevId: 586006558
Adds parameter and return type annotations to the majority of public functions
and methods in the `test_util` module. This includes annotations for methods on
`TensorFlowTestCase` which return values, but omits the assertion methods.

If adding types is currently infeasible
(due to complexity of the signature,
limitations of the supported versions of python, type checker limitations, etc.),
then this change simply does not add those annotations.

PiperOrigin-RevId: 586008562
…d we can remove the compatibility support here.

PiperOrigin-RevId: 586020786
Factored out a common pattern of mutating `NodeDef`s by iterating all node defs in a `GraphDef` into a templated function and applied it for both `enable_dump_tensor` and `change_dump_tensor_file_name`.

PiperOrigin-RevId: 586028409
…ate*

Also remove some unused #includes from the .cc files.
Also use "= default" syntax for destructor.

PiperOrigin-RevId: 586069299
… data size.

This is an implementation detail.

PiperOrigin-RevId: 586097271
There is apparently no feasible way of resolving the TODO comment.

PiperOrigin-RevId: 586106981
PiperOrigin-RevId: 586130880
This adds the configs needed for us to be able to run a Linux Arm64 GitHub presubmit on incoming PRs. It runs tests by cross-compiling test binaries on remote Linux x86 VMs using RBE and then executing the built test binaries on the host Arm64 VM.

On average, this presubmit should take about ~30 mins and is ~83% faster than the current GitHub Linux Arm64 presubmit (https://github.com/tensorflow/tensorflow/actions/workflows/arm-ci.yml).

I have changed the name of the cross-compile env file to add the Python version it runs and to be consistent with other env names.

PiperOrigin-RevId: 586144808
Imported from GitHub PR openxla/xla#7136

This PR add the `Allocate` command to command buffer.

The `Allocate` command is constructed with the pointer to `BufferAllocation`. The allocation will be performed when the command is recorded, the allocated address will be tracked by command buffer runtime through allocation index.  For the consumer commands who want to access the allocated buffer, the record parameter buffer address should be provided as se::DeviceMemoryBase with special address (LAZY_ALLOCATE_ADDRESS_MARKER) and non-zero size, and it can be created with API  se::DeviceMemory<>::MakeLazyAllocAddressFromByteSize(byte_length);

Below is an example how to construct command sequences that access buffers allocated inside command buffer:

```
  BufferAllocation alloc_a(/*index=*/0, byte_length, /*color=*/0);
  BufferAllocation alloc_b(/*index=*/1, byte_length, /*color=*/0);
  BufferAllocation alloc_c(/*index=*/2, byte_length, /*color=*/0);
  BufferAllocation::Slice slice_a(&alloc_a, 0, byte_length);
  BufferAllocation::Slice slice_b(&alloc_b, 0, byte_length);
  BufferAllocation::Slice slice_c(&alloc_c, 0, byte_length);

  // Prepare commands sequence for constructing command buffer.
  CommandBufferCmdSequence commands;
  commands.Emplace<AllocateCmd>(&alloc_b);
  commands.Emplace<MemcpyDeviceToDeviceCmd>(slice_b, slice_a, byte_length);
  commands.Emplace<MemcpyDeviceToDeviceCmd>(slice_c, slice_b, byte_length);

  // Construct a thunk with command sequence.
  CommandBufferThunk thunk(std::move(commands), Thunk::ThunkInfo(nullptr));

  // Prepare arguments: a=42, b=0
  se::DeviceMemory<int32_t> a = executor->AllocateArray<int32_t>(length, 0);
  stream.ThenMemset32(&a, 42, byte_length);

  se::DeviceMemory<int32_t> b = se::DeviceMemory<int32_t>::MakeLazyAllocAddressFromByteSize(byte_length);
  se::DeviceMemory<int32_t> c = executor->AllocateArray<int32_t>(length, 0);
  BufferAllocations allocations({a, b, c}, 0, executor->GetAllocator());

  ServiceExecutableRunOptions run_options;
  Thunk::ExecuteParams params(run_options, allocations, &stream, {});

  // Execute command buffer thunk and verify that it copied the memory.
  TF_ASSERT_OK(thunk.ExecuteOnStream(params));

```

For CUDA implementation, the command has no update parameters, which means that when the command is added to command buffer, the address range allocated for this command is fixed across command buffer launches.

The `Allocation` command is only implemented for CUDA platform
Copybara import of the project:

--
d2cdd0423fe5947e06d8d7b8d5192a8845b2beae by Shawn Wang <shawnw@nvidia.com>:

Add Allocate command to command buffer

Merging this change closes tensorflow#7136

PiperOrigin-RevId: 586150993
These are not ready yet.

PiperOrigin-RevId: 586160312
PiperOrigin-RevId: 586163360
…to generate sharding strategies. For those that cannot be, we rely on pre-existing convolution handling code.

PiperOrigin-RevId: 586163568
…n in a While command

PiperOrigin-RevId: 586164354
…ation test is skipped for now because the full support for convolution is not implemented.

Refactored the target op quantization pattern matching to compatible with dot-like ops.

PiperOrigin-RevId: 586166124
…vides built-in utilities for saving & loading).

PiperOrigin-RevId: 586169032
Our Mac builds require some specific build environment setup such as installing Bazelisk, upgrading Pyenv, installing Python, etc. Since these scripts are meant to be run by both internal CI builds and external users, we re-work some conditional logic that were previously only meant to run for internal CI builds. These will now instead use the `TFCI_*_ENABLE` variables. This makes the conditionals from being possibly confusing system checks in scripts to explicit settings in "envs" files and allows both internal CI builds and external users to decide if they want to enable or disable a particular macOS build environment setup task.

PiperOrigin-RevId: 586173730
tyb0807 and others added 20 commits December 4, 2023 15:15
PiperOrigin-RevId: 587849841
PiperOrigin-RevId: 587851943
…tandalone utility

PiperOrigin-RevId: 587857877
Block and thread dimensions already available in device kernels, so there should no reason to add extra kernel parameters for them. For CUTLASS gemm args packing we know thread dimensions statically from an operation template.

PiperOrigin-RevId: 587863262
PiperOrigin-RevId: 587865806
1. Deduplicate the postprocessing code for dots and convs.
2. Combine the InferInputShardingForTopK function with GetInputSharding function, and get rid of an unused parameter in the later.

PiperOrigin-RevId: 587868586
…a standalone utility

PiperOrigin-RevId: 587869254
…ding HloValue in the producer instruction, if the producer instruction is a tuple.

PiperOrigin-RevId: 587874472
We got unlucky and hit a seed which happens to fail the KS test.

PiperOrigin-RevId: 587885112
…rd-swish-fusion-fp32-bf16

PiperOrigin-RevId: 587906135
…ingerprint; this contains information (like solver wall time) that can vary between runs.

PiperOrigin-RevId: 587906345
…_sinking

by doing a prepass to detect whether to construct a fusion.

PiperOrigin-RevId: 587914972
…ion library.

This change fixes a rare issue where two component functions are registered on a remote eager context, and their function libraries contain a function with the same name but a different body. When this happens the second registration fails due to a duplicate function upon adding it to the context-wide `FunctionLibraryDefinition`.

To avoid this problem, when registering a component function, we use the `FunctionDefLibrary` shipped to create a private `FunctionLibraryDefinition` for running that function. We can do this relatively easily because the eager `ClusterFunctionLibraryRuntime` ships all reachable functions along with the root component function; and we have long-standing support for instantiating a function with an "overlay" `FunctionLibraryDefinition`.

The behavior matches the TF1 `ClusterFunctionLibraryRuntime`, which ships an entire private library as part of the subgraph it registers with a remote worker, and creates a new `FunctionLibraryDefinition` and `ProcessFunctionLibraryRuntime` for that subgraph.

Note that support for removing a component function via the `ClusterFunctionLibraryRuntime` was previously unsupported. We rely on this to simplify the ownership of the private `FunctionLibraryDefinition` objects, which are owned by the `EagerContext` and never deleted. Future support for remove would likely require using refcounted or otherwise-shared `FunctionLibraryDefinition` objects in the FLR stack.

(In our experience, the issue is the result of an MLIR rewrite that canonicalizes the same source function in two different ways, so e.g. the choice of retained node for common subexpression elimination is different, but the two versions are functionally equivalent. In principle, making that rewrite deterministic, or making it choose a new name for the rewritten function would also solve the problem. However, I prefer this approach because it is robust to less-than-perfect rewrite passes, and we have a lot of rewrite passes.)

PiperOrigin-RevId: 587920362
We need to install Bazelisk and Pyenv manually as these are not present on the x86 Mac VMs. Note that the uploads from these new jobs are disabled as they are not yet ready. However, the old Mac x86 nightly builds will still be running and upload to tf-nightly so there won't be any missing nightly packages while we are doing this migration.

PiperOrigin-RevId: 587930871
jayfurmanek and others added 3 commits December 12, 2023 19:17
Conflicts:
        third_party/xla/xla/service/gpu/BUILD
        third_party/xla/xla/service/gpu/buffer_comparator_test.cc
        third_party/xla/xla/stream_executor/device_description.h
        third_party/xla/xla/stream_executor/rocm/hip_blas_lt.cc
        third_party/xla/xla/stream_executor/rocm/hip_blas_lt.h
        third_party/xla/xla/tests/BUILD
…nDef()`.

Some compilers do not like using the name of a class as a method, which is fair enough.

PiperOrigin-RevId: 588312567
@draganmladjenovic
Copy link

Retest Ubuntu-GPU-single please.
Retest Ubuntu-CPU please.

@draganmladjenovic
Copy link

Retest gpu-pycpp please.

@draganmladjenovic draganmladjenovic merged commit e4b2051 into develop-upstream Dec 15, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.