Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge upstream #77

Open
wants to merge 665 commits into
base: amd-develop
Choose a base branch
from
Open

Merge upstream #77

wants to merge 665 commits into from

Conversation

fsx950223
Copy link

No description provided.

aakhundov and others added 30 commits April 10, 2023 10:27
Summary:
Pull Request resolved: facebookincubator#556

There is a bug in the current GEMM profiler's way of using the memory pool: the tensors are requested only once for the entire GEMM kernel's profiling loop. The fact that the same tensors / memory regions / pointers are used in all iterations of the kernel's profiling loop render the memory pool virtually useless. The risk is that small inputs may stick in the GPU's L2 cache, leading to unreliable profiling results.

In this diff we fix the bug by modifying the GEMM back-end profiler templates in a way that the `memory_pool->RequestTensorByIdx(...)` calls are made *within* the profiling loop, hence rotating the inputs for every call and eschewing L2 caching. Experiments with simple GEMM on small problem sizes (e.g., `M=1024, N=512, K=256`) have shown that, after the fix, the runtimes measured in profiling can grow up to 30% for some of the kernels. The selected best kernel can also change as a result.

Reviewed By: tenpercent

Differential Revision: D44816867

fbshipit-source-id: 27259671614422cbe3072d578842b5bc617dc830
Summary: Pull Request resolved: facebookincubator#560

Reviewed By: henryhu6

Differential Revision: D44854358

Pulled By: terrychenism

fbshipit-source-id: a80e704f35aea69ba57c1b0d7bf1785312aa88bf
Summary: Pull Request resolved: facebookincubator#551

Reviewed By: tenpercent

Differential Revision: D44814768

fbshipit-source-id: 71184eeb0c95bafbd853ea4685e2135423c7df8b
…vistor (facebookincubator#552)

Summary:
Pull Request resolved: facebookincubator#552

cutlass::gemm::GemmCoord uses int values as coordinates under the hood, while AIT might use int64_t variables in {M, N, K} constructor. So, narrowing conversion is needed.

Reviewed By: tenpercent

Differential Revision: D44814784

fbshipit-source-id: 521fb91570fea19c4a651e71ea93e2e0c787eb48
…kincubator#530)

Summary:
Pull Request resolved: facebookincubator#530

ATT.
Also updated b2b bmm kernels to support alpha1_divide_by_seq_len.

Reviewed By: aakhundov, kadeng

Differential Revision: D44451037

fbshipit-source-id: dc104bed4edff38d99d2117815d700b516a50c73
Summary:
Pull Request resolved: facebookincubator#563

The `recude_*` ops seem to fail [this assertion](https://github.com/facebookincubator/AITemplate/blob/main/python/aitemplate/backend/cuda/reduce/reduce_small_axis.py#L316) when the last input dimension is `IntVar`. The problem seems to be that the reduction axis is assumed to be -1 in the `_get_read_vector_type` function, even if it's actually not. Hence the check [here](https://github.com/facebookincubator/AITemplate/blob/main/python/aitemplate/backend/cuda/reduce/reduce_small_axis.py#L413) against the actual reduction axis passes, but the subsequent aforementioned assertion fails.

This diff replaces the assertion by using the `input_type` as the `read_vector_type` if the last input dim is `IntVar`, as the `IntVar` reduction dim's value can be odd in the runtime. Instead of failing the assertion the code compilation successfully completes.

Reviewed By: chenyang78

Differential Revision: D44915126

fbshipit-source-id: 34a8d9b8f0b678468ed1e80f4ae56b34aafc1c5e
Summary:
Pull Request resolved: facebookincubator#541

See T148695911

With D44229622 we could prove that it should be possible to speed up unit tests and therefore also CI runs considerably.

The task was to integrate the build cache with Sandcastle CI
in order to speed up our CI process.

For reference about considered options, tradeoffs and decision process:

Original design doc at https://docs.google.com/document/d/1GHuhIJ83CsS3hgB8bV53TDTIqavqpPl4guP_kDcWdII/edit
Final design review meeting slides & notes: https://docs.google.com/presentation/d/1bICc-OtCp1kgisL3SOCN7XYN4ZRn9a6JX62eMjFUI68/edit#slide=id.g1e0053f1f88_0_53

Implementation:

 [x] Created a Manifold-based build cache implementation
 [x] incorporated it into the non-OSS part of the codebase, similar to fb/detect_model.py in fb/build_cache.py
 [x] Sets TTL on stored objects. Resets this TTL on read (  asynchronously, no need to wait for this before continuing )
 [x]Archiving and storing of files to be cached happen asynchronously in order not to delay the tests.
 [x]Investigated whether we can get Manifold latency down by creating a new bucket with different settings ( did not work for me)

 Add features and config options to:

 [x] Disabled caching for a compile_model call, entire unit test or globally ( env var )

 [x]Disabled the build cache for profiling only ( env var )
Not use the cache with a certain probability (in order to keep the build system and cache under test)
I
 [x]Incorporated info from question on Manifold Users Workplace group, whether we can use the official Manifold Client for this usecase ( https://fb.workplace.com/groups/ManifoldUsers/permalink/1682913152123392/ )

(Unless we quickly get an answer, the first implementation should use the deprecated manifold client, because that is proven to work and safe in multiprocessing. )

 [x] Does not cache .obj files ( unneccessary, and takes up large amount of storage in many cases )

 [x] Added unit test ( mock Manifold client )

Reviewed By: ipiszy, aakhundov

Differential Revision: D44642328

fbshipit-source-id: 9d2ec65e953d7f513d4325a7d1cc834f1b5afb75
…or#565)

Summary:
Pull Request resolved: facebookincubator#565

There were reports of corrupted CUTLASS include directories which led to build failures which could only be resolved by manually deleting a directory generated by the FBCUDA target below /tmp. This fix attempts to make the corresponding logic more robust against edge cases and errors, as well as fail early if assertions are violated.

Reviewed By: aakhundov

Differential Revision: D44918599

fbshipit-source-id: e02e8f272ac8c625522c069a98a679383bbff883
Summary:
Pull Request resolved: facebookincubator#562

conv1d can be expressed in terms of conv2d, so I didn't introduce any new kernel, but customized conv2d kernel generation

Reviewed By: terrychenism

Differential Revision: D44894688

fbshipit-source-id: c6e1d8894498302cf43bfe8c07ee9779b94fe3d2
Summary:
Pull Request resolved: facebookincubator#566

Refactoring "arange" tensor used in time embeddings to be model parameter.

Reviewed By: henryhu6

Differential Revision: D44903108

fbshipit-source-id: 227a2d4d2fee126dab02393af71ba35bef82936d
…ator#570)

Summary:
Consider we have a following graph:

  concat_0 = concatenate(x0, x0)
  reshape_1 = reshape(concat_0)
  concat_2 = concat(reshape_1, x1)
  concat_3 = concatenate(concat_0, x2)

Previously, our move_view_ops pass would end up with an infinite loop, because it turned the graph into forms that were always valid for another iteration, e.g.

  (1) after the first iteration:

  concat_0 = concatenate(x0, x0)
  concat_2 = concat(concat_0, x1)
  new_reshape = reshape(concat_2)
  concat_3 = concatenate(new_reshape, x2)

  (2) after the second iteration:

  concat_0 = concatenate(x0, x0)
  new_reshape = reshape(concat_0)
  concat_2 = concat(new_reshape, x1)
  concat_3 = concatenate(concat_0, x2)

  and so on.

  This PR fixed the issue by skipping the pattern.

Pull Request resolved: facebookincubator#570

Reviewed By: hl475

Differential Revision: D44946922

Pulled By: chenyang78

fbshipit-source-id: ff91fef90218feb4679e5b073979a8de02d912a8
Summary:
Pull Request resolved: facebookincubator#516

Symbolic shape support has landed, remove hacks that were used.

Reviewed By: tissue3

Differential Revision: D44482705

fbshipit-source-id: 685c74efa0b4a2cec6a2f963fff4b0437b44a32e
…acebookincubator#559)

Summary:
Pull Request resolved: facebookincubator#559

`_fuse_strided_op_and_cat` pass inside `transform_strided_ops` shouldn't fuse GEMM and concat if concatenation is happening along a dimension >= rank of the original shape. This happens, for example, when GEMM output of shape `(M, N)` is unsqueezed to `(M, N, 1)` and concatenated with another `(M, N, 1)`. Such fusion would require GEMM to write the last dimension into memory in a non-contiguous way, which is not supported for row-major output (only one stride is supported).
However, fusion is possible when unsqueezed dimension is internal - e.g. when final shape is `(M, 1, N)`.
Method `TensorAccessor.is_rightmost_dim_contiguous` checks if fusion is possible based on these criteria.

Reviewed By: tissue3, aakhundov

Differential Revision: D44747795

fbshipit-source-id: 4fbb005ce27d32654bda68f8405ec06b23f17a1a
Summary:
Pull Request resolved: facebookincubator#577

It may happen that `total_length` passed to the `padded_dense_to_jagged` op is actually a `JaggedIntVar`. In such cases, the `total_length` is fetched from the `shape[0]` of a tensor that already happens to be jagged. Before this diff, this has caused an exception in the `padded_dense_to_jagged` front-end validation. The diff fixes this by fetching the `total_length` from within the passed `JaggedIntVar`.

Reviewed By: muchulee8

Differential Revision: D44997496

fbshipit-source-id: cebc005569c66c43fcf6443547ace1332e6df050
Summary:
Pull Request resolved: facebookincubator#555

Ship the module profiling utility externally

Reviewed By: wushirong

Differential Revision: D44567322

fbshipit-source-id: 4f8ca36dbdc72dfa60e667c3592d0a2bc466b994
Summary:
Pull Request resolved: facebookincubator#575

Initial foundations are added for further support of the CUTLASS SM90 kernels. With these changes, under CUDA 12 (arch 90), the SM90 kernels will be generated, but not considered anywhere in the GEMM back-end (due to the special `GemmKind.Universal3x` not being matched against).

Reviewed By: chenyang78, tenpercent

Differential Revision: D44985884

fbshipit-source-id: 527848875f686fd582a28d7b1575734e2b1e66e6
Summary:
Pull Request resolved: facebookincubator#576

We suppose to bypass this op but in extreme case like
a = placeholder(); return a.to()

It introduces a node in AIT graph which has is_input=True and is_output=True. The node name is output_xx
fx2ait throws error when doing the input name binding. So we add an extra reshape layer here which brings no computation.

Reviewed By: hl475, chenyang78

Differential Revision: D44991256

fbshipit-source-id: afc951c23d205351166a0407ff4b9f218b075bff
Summary:
Pull Request resolved: facebookincubator#578

To allow split op with no dim as input

Reviewed By: qxy11, Gavin-Cheng

Differential Revision: D45011263

fbshipit-source-id: 45f57064588245ad3bfe73dcbba5a92d07eb3bc3
Summary:
Pull Request resolved: facebookincubator#568

Replace infer_shape for split op with symbolic shape.

Reviewed By: frank-wei

Differential Revision: D44906484

fbshipit-source-id: 48215334ef0df7e0535d9d909a9e799362c66ec6
Summary:
Pull Request resolved: facebookincubator#579

Add Identity op in AIT.
This ops provides a way to "duplicate" tensors.
Possible usage include, but not limited to wanting to have 2 names for 1 tensor.

The backend does a memcpy if the resulting tensor is an output. Otherwise, the implementation is only a view.

Reviewed By: chenyang78

Differential Revision: D44993567

fbshipit-source-id: 37368398e68d8b527c33a9f13f141566e2a77c30
Summary:
Pull Request resolved: facebookincubator#586

Currently the build cache is also caching the results of failed builds. While not neccessarily wrong, the fact that build errors are not shown on repeated invocation is both irritating and a problem when attempting to pinpoint errors.

This is a minor code change which fixes that.

Reviewed By: aakhundov

Differential Revision: D45043453

fbshipit-source-id: f44fd399aa24f45178f89b667218d042cd966676
Summary:
Pull Request resolved: facebookincubator#587

Adds AIT BatchNorm1d, BatchNorm2d, and BatchNorm3d FE module

Reviewed By: terrychenism

Differential Revision: D44922251

fbshipit-source-id: d2ae030fea244a1a398609209b52798d4444201c
Summary:
Pull Request resolved: facebookincubator#588

as titled

Reviewed By: qxy11, wushirong

Differential Revision: D45055690

fbshipit-source-id: 3930eec56f56bd9253c7aeac9562d8131493710f
Summary: Pull Request resolved: facebookincubator#569

Reviewed By: wfanzju, wushirong

Differential Revision: D44942651

fbshipit-source-id: af5b6c66d578d7460e297309bcc21920a632489a
Summary:
Pull Request resolved: facebookincubator#585

Now that bfloat16 and float32 are also supported, we don't need to hardcode to float16.

Reviewed By: tenpercent

Differential Revision: D45028307

fbshipit-source-id: 2fb3d2d22f2bb3761505cecf1d7f220497e529cc
Summary:
Pull Request resolved: facebookincubator#593

As titled. std in fx2ait is represented by combinations of arithmetic ops and 2 reduce ops.:
```
Y = sqrt( mean(pow(X-mean(X))) )
```

However issue occurs when the first deduce op uses `keepdim=False`. It caused issue with subsequent sub op where X has old dimension while mean(X) only has 1 dim less than that.

Also, it seems there is no unittest for acc_ops when tracing std. This diff added the test

Reviewed By: amateurcoffee

Differential Revision: D45105942

fbshipit-source-id: 04f9c1105a2a6a711d025d5c85b95147343d0ecd
Summary:
Pull Request resolved: facebookincubator#591

A race condition in FBCUDA could be triggered when it was launched in a parallel.
See https://www.internalfb.com/phabricator/paste/view/P697768916

Update:

After further investigation, the source of the race condition appears to be between multiple processes and target_def.py CUDA.__exit__(...) code deleting things, and FBCUDA.__init__(...) code copying and writing files into directories that the other process is busy deleting. It was not obvious that CUDA.__exit__ would delete paths that FBCUDA.__init__() created.

After short discussions, we concluded that there is no way to clean this up with minimal risk code changes - a larger rewrite of target_def.py would be neccessary, also to prevent resource leaks, but is out of scope for a hotfix.

So the safest way to go, and what is now implemented here is the following:

 * Revert target_def.py to the pre-build-cache state where these issues were avoided by creating new randomly named temp directories every time.
 * In order to not get new  build cache keys everytime which would make caching impossible, the build cache makefile normalization was adapted instead.

The impact on build times appears neglible ( approx. 0.08 seconds )

Unit tests were added and modified accordingly as well.

Reviewed By: wushirong, aakhundov

Differential Revision: D45093471

fbshipit-source-id: 9473d96ac33178dcddc98cf988fa4560c7e85f96
Summary:
Pull Request resolved: facebookincubator#584

ATT

Reviewed By: muchulee8, chenyang78, alexanderguzhva, tenpercent

Differential Revision: D45028066

fbshipit-source-id: 049ffee0f4e8f5f7e6fbd0b3517e2db7e9520c9f
henryhu6 and others added 30 commits June 26, 2023 16:34
)

Summary: Pull Request resolved: facebookincubator#792

Reviewed By: terrychenism, Yinan-Zhao

Differential Revision: D46828343

fbshipit-source-id: a190e2b39ee3fc8927560b6faf6d980c2c4b6a55
Summary:
Pull Request resolved: facebookincubator#783

As titled, this diffs adds supports for relational operations (e.g. ge, le, gt, lt, eq, ne). The expected behavior is to match equivalent operators in Torch, e.g. https://pytorch.org/docs/stable/generated/torch.ge.html

There are multiple constraints:
(1) Type promotions are not supported.
(2) Broadcast is not supported yet, so it expects tensor of the same shape - this maybe fixed in future iterations
(3) It uses the relational operators, i.e, >=, <, > rather than comparison function that supports different level of precision, e.g. https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH____HALF__COMPARISON.html   - this maybe fixed in future iterations

Reviewed By: aakhundov

Differential Revision: D46770449

fbshipit-source-id: ed58c8ff0148b389dc80f76fe6a4413f21c36a2f
…are already set (facebookincubator#799)

Summary:
Pull Request resolved: facebookincubator#799

In some cases, when we fuse a slice op with a strided op (e.g. concat), the
relevant tensor accessors in the strided op's input tensors may have already
be updated. Let's skip fusion in such a case.

Note that technically, we might be able to perform fusion in some senarios,
but let's handle those later if we see needs.

Reviewed By: qxy11

Differential Revision: D47028458

fbshipit-source-id: a0648c6854db9f69a83fefa698d70eb17b272f70
Summary:
Pull Request resolved: facebookincubator#791

As titled, this diffs adds supports for where operations. The expected behavior is to match equivalent operators in Torch, i.e. https://pytorch.org/docs/stable/generated/torch.where.html

Reviewed By: aakhundov

Differential Revision: D46957405

fbshipit-source-id: db4bdf4f2d91d154fb0c9ee092bf6429679b63db
…r#794)

Summary:
Pull Request resolved: facebookincubator#794

The existing tests don't use the `slice_reshape_scatter` op, so the `split_large_slice_scatter` logic is not tested.

If we remove `split_large_slice_scatter` from `optimize_graph`, the existing tests can still pass.
The newly added test cases would fail without `split_large_slice_scatter`.

Reviewed By: muchulee8

Differential Revision: D46770757

fbshipit-source-id: 69a48580df6e5108a5e03097dec2835934c718db
Summary:
Pull Request resolved: facebookincubator#800

Initialize random weights for AIT constants during model compilation to prevent identical weights being compared when testing accuracy of PT module vs AIT module.

Reviewed By: henryhu6

Differential Revision: D47031569

fbshipit-source-id: f063a8b13d3a530f7c667ce4b2259f9177bdd4fa
Summary:
Pull Request resolved: facebookincubator#803

ATT

Reviewed By: wushirong

Differential Revision: D47060963

fbshipit-source-id: 8fd0f57e8b3e0d85396a10397e8fa0a380a9cd8c
…tor#784)

Summary:
Currently SD Alternative pipeline examples use `demo.py` script.
I think it should use `demo_alt.py` instead

Other minor fixes:
- made demo_alt.py example commands multiline
- fixed CompVis spelling and added href to it.

Pull Request resolved: facebookincubator#784

Reviewed By: chenyang78

Differential Revision: D47039823

Pulled By: ipiszy

fbshipit-source-id: 885cbcef4a7904936da66d817d0eb62e06f5335a
…ebookincubator#801)

Summary:
Pull Request resolved: facebookincubator#801

Split slice_scatter into multiple ones if it has too many inputs. The process is very similar to split slice_reshape_scatter.

Added the TensorAccessor attribute in slice_scatter op (but will only use its offset field) to make the split logic work.

Reviewed By: chenyang78

Differential Revision: D46962881

fbshipit-source-id: 40457bfd5f9ec607802a70d01e7020aac0e6b3c8
Summary: Pull Request resolved: facebookincubator#777

Reviewed By: chenyang78

Differential Revision: D47039764

Pulled By: ipiszy

fbshipit-source-id: 4a2fa9228272ed32544498b68af4f4d42c02a460
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.