forked from facebookincubator/AITemplate
-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge upstream #77
Open
fsx950223
wants to merge
665
commits into
amd-develop
Choose a base branch
from
merge_upstream
base: amd-develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Merge upstream #77
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Summary: Pull Request resolved: facebookincubator#556 There is a bug in the current GEMM profiler's way of using the memory pool: the tensors are requested only once for the entire GEMM kernel's profiling loop. The fact that the same tensors / memory regions / pointers are used in all iterations of the kernel's profiling loop render the memory pool virtually useless. The risk is that small inputs may stick in the GPU's L2 cache, leading to unreliable profiling results. In this diff we fix the bug by modifying the GEMM back-end profiler templates in a way that the `memory_pool->RequestTensorByIdx(...)` calls are made *within* the profiling loop, hence rotating the inputs for every call and eschewing L2 caching. Experiments with simple GEMM on small problem sizes (e.g., `M=1024, N=512, K=256`) have shown that, after the fix, the runtimes measured in profiling can grow up to 30% for some of the kernels. The selected best kernel can also change as a result. Reviewed By: tenpercent Differential Revision: D44816867 fbshipit-source-id: 27259671614422cbe3072d578842b5bc617dc830
Summary: Pull Request resolved: facebookincubator#560 Reviewed By: henryhu6 Differential Revision: D44854358 Pulled By: terrychenism fbshipit-source-id: a80e704f35aea69ba57c1b0d7bf1785312aa88bf
Summary: Pull Request resolved: facebookincubator#551 Reviewed By: tenpercent Differential Revision: D44814768 fbshipit-source-id: 71184eeb0c95bafbd853ea4685e2135423c7df8b
…vistor (facebookincubator#552) Summary: Pull Request resolved: facebookincubator#552 cutlass::gemm::GemmCoord uses int values as coordinates under the hood, while AIT might use int64_t variables in {M, N, K} constructor. So, narrowing conversion is needed. Reviewed By: tenpercent Differential Revision: D44814784 fbshipit-source-id: 521fb91570fea19c4a651e71ea93e2e0c787eb48
…kincubator#530) Summary: Pull Request resolved: facebookincubator#530 ATT. Also updated b2b bmm kernels to support alpha1_divide_by_seq_len. Reviewed By: aakhundov, kadeng Differential Revision: D44451037 fbshipit-source-id: dc104bed4edff38d99d2117815d700b516a50c73
Summary: Pull Request resolved: facebookincubator#563 The `recude_*` ops seem to fail [this assertion](https://github.com/facebookincubator/AITemplate/blob/main/python/aitemplate/backend/cuda/reduce/reduce_small_axis.py#L316) when the last input dimension is `IntVar`. The problem seems to be that the reduction axis is assumed to be -1 in the `_get_read_vector_type` function, even if it's actually not. Hence the check [here](https://github.com/facebookincubator/AITemplate/blob/main/python/aitemplate/backend/cuda/reduce/reduce_small_axis.py#L413) against the actual reduction axis passes, but the subsequent aforementioned assertion fails. This diff replaces the assertion by using the `input_type` as the `read_vector_type` if the last input dim is `IntVar`, as the `IntVar` reduction dim's value can be odd in the runtime. Instead of failing the assertion the code compilation successfully completes. Reviewed By: chenyang78 Differential Revision: D44915126 fbshipit-source-id: 34a8d9b8f0b678468ed1e80f4ae56b34aafc1c5e
Summary: Pull Request resolved: facebookincubator#541 See T148695911 With D44229622 we could prove that it should be possible to speed up unit tests and therefore also CI runs considerably. The task was to integrate the build cache with Sandcastle CI in order to speed up our CI process. For reference about considered options, tradeoffs and decision process: Original design doc at https://docs.google.com/document/d/1GHuhIJ83CsS3hgB8bV53TDTIqavqpPl4guP_kDcWdII/edit Final design review meeting slides & notes: https://docs.google.com/presentation/d/1bICc-OtCp1kgisL3SOCN7XYN4ZRn9a6JX62eMjFUI68/edit#slide=id.g1e0053f1f88_0_53 Implementation: [x] Created a Manifold-based build cache implementation [x] incorporated it into the non-OSS part of the codebase, similar to fb/detect_model.py in fb/build_cache.py [x] Sets TTL on stored objects. Resets this TTL on read ( asynchronously, no need to wait for this before continuing ) [x]Archiving and storing of files to be cached happen asynchronously in order not to delay the tests. [x]Investigated whether we can get Manifold latency down by creating a new bucket with different settings ( did not work for me) Add features and config options to: [x] Disabled caching for a compile_model call, entire unit test or globally ( env var ) [x]Disabled the build cache for profiling only ( env var ) Not use the cache with a certain probability (in order to keep the build system and cache under test) I [x]Incorporated info from question on Manifold Users Workplace group, whether we can use the official Manifold Client for this usecase ( https://fb.workplace.com/groups/ManifoldUsers/permalink/1682913152123392/ ) (Unless we quickly get an answer, the first implementation should use the deprecated manifold client, because that is proven to work and safe in multiprocessing. ) [x] Does not cache .obj files ( unneccessary, and takes up large amount of storage in many cases ) [x] Added unit test ( mock Manifold client ) Reviewed By: ipiszy, aakhundov Differential Revision: D44642328 fbshipit-source-id: 9d2ec65e953d7f513d4325a7d1cc834f1b5afb75
…or#565) Summary: Pull Request resolved: facebookincubator#565 There were reports of corrupted CUTLASS include directories which led to build failures which could only be resolved by manually deleting a directory generated by the FBCUDA target below /tmp. This fix attempts to make the corresponding logic more robust against edge cases and errors, as well as fail early if assertions are violated. Reviewed By: aakhundov Differential Revision: D44918599 fbshipit-source-id: e02e8f272ac8c625522c069a98a679383bbff883
Summary: Pull Request resolved: facebookincubator#562 conv1d can be expressed in terms of conv2d, so I didn't introduce any new kernel, but customized conv2d kernel generation Reviewed By: terrychenism Differential Revision: D44894688 fbshipit-source-id: c6e1d8894498302cf43bfe8c07ee9779b94fe3d2
Summary: Pull Request resolved: facebookincubator#566 Refactoring "arange" tensor used in time embeddings to be model parameter. Reviewed By: henryhu6 Differential Revision: D44903108 fbshipit-source-id: 227a2d4d2fee126dab02393af71ba35bef82936d
…ator#570) Summary: Consider we have a following graph: concat_0 = concatenate(x0, x0) reshape_1 = reshape(concat_0) concat_2 = concat(reshape_1, x1) concat_3 = concatenate(concat_0, x2) Previously, our move_view_ops pass would end up with an infinite loop, because it turned the graph into forms that were always valid for another iteration, e.g. (1) after the first iteration: concat_0 = concatenate(x0, x0) concat_2 = concat(concat_0, x1) new_reshape = reshape(concat_2) concat_3 = concatenate(new_reshape, x2) (2) after the second iteration: concat_0 = concatenate(x0, x0) new_reshape = reshape(concat_0) concat_2 = concat(new_reshape, x1) concat_3 = concatenate(concat_0, x2) and so on. This PR fixed the issue by skipping the pattern. Pull Request resolved: facebookincubator#570 Reviewed By: hl475 Differential Revision: D44946922 Pulled By: chenyang78 fbshipit-source-id: ff91fef90218feb4679e5b073979a8de02d912a8
Summary: Pull Request resolved: facebookincubator#516 Symbolic shape support has landed, remove hacks that were used. Reviewed By: tissue3 Differential Revision: D44482705 fbshipit-source-id: 685c74efa0b4a2cec6a2f963fff4b0437b44a32e
…acebookincubator#559) Summary: Pull Request resolved: facebookincubator#559 `_fuse_strided_op_and_cat` pass inside `transform_strided_ops` shouldn't fuse GEMM and concat if concatenation is happening along a dimension >= rank of the original shape. This happens, for example, when GEMM output of shape `(M, N)` is unsqueezed to `(M, N, 1)` and concatenated with another `(M, N, 1)`. Such fusion would require GEMM to write the last dimension into memory in a non-contiguous way, which is not supported for row-major output (only one stride is supported). However, fusion is possible when unsqueezed dimension is internal - e.g. when final shape is `(M, 1, N)`. Method `TensorAccessor.is_rightmost_dim_contiguous` checks if fusion is possible based on these criteria. Reviewed By: tissue3, aakhundov Differential Revision: D44747795 fbshipit-source-id: 4fbb005ce27d32654bda68f8405ec06b23f17a1a
Summary: Pull Request resolved: facebookincubator#577 It may happen that `total_length` passed to the `padded_dense_to_jagged` op is actually a `JaggedIntVar`. In such cases, the `total_length` is fetched from the `shape[0]` of a tensor that already happens to be jagged. Before this diff, this has caused an exception in the `padded_dense_to_jagged` front-end validation. The diff fixes this by fetching the `total_length` from within the passed `JaggedIntVar`. Reviewed By: muchulee8 Differential Revision: D44997496 fbshipit-source-id: cebc005569c66c43fcf6443547ace1332e6df050
Summary: Pull Request resolved: facebookincubator#555 Ship the module profiling utility externally Reviewed By: wushirong Differential Revision: D44567322 fbshipit-source-id: 4f8ca36dbdc72dfa60e667c3592d0a2bc466b994
Summary: Pull Request resolved: facebookincubator#575 Initial foundations are added for further support of the CUTLASS SM90 kernels. With these changes, under CUDA 12 (arch 90), the SM90 kernels will be generated, but not considered anywhere in the GEMM back-end (due to the special `GemmKind.Universal3x` not being matched against). Reviewed By: chenyang78, tenpercent Differential Revision: D44985884 fbshipit-source-id: 527848875f686fd582a28d7b1575734e2b1e66e6
Summary: Pull Request resolved: facebookincubator#576 We suppose to bypass this op but in extreme case like a = placeholder(); return a.to() It introduces a node in AIT graph which has is_input=True and is_output=True. The node name is output_xx fx2ait throws error when doing the input name binding. So we add an extra reshape layer here which brings no computation. Reviewed By: hl475, chenyang78 Differential Revision: D44991256 fbshipit-source-id: afc951c23d205351166a0407ff4b9f218b075bff
Summary: Pull Request resolved: facebookincubator#578 To allow split op with no dim as input Reviewed By: qxy11, Gavin-Cheng Differential Revision: D45011263 fbshipit-source-id: 45f57064588245ad3bfe73dcbba5a92d07eb3bc3
Summary: Pull Request resolved: facebookincubator#568 Replace infer_shape for split op with symbolic shape. Reviewed By: frank-wei Differential Revision: D44906484 fbshipit-source-id: 48215334ef0df7e0535d9d909a9e799362c66ec6
Summary: Pull Request resolved: facebookincubator#579 Add Identity op in AIT. This ops provides a way to "duplicate" tensors. Possible usage include, but not limited to wanting to have 2 names for 1 tensor. The backend does a memcpy if the resulting tensor is an output. Otherwise, the implementation is only a view. Reviewed By: chenyang78 Differential Revision: D44993567 fbshipit-source-id: 37368398e68d8b527c33a9f13f141566e2a77c30
Summary: Pull Request resolved: facebookincubator#586 Currently the build cache is also caching the results of failed builds. While not neccessarily wrong, the fact that build errors are not shown on repeated invocation is both irritating and a problem when attempting to pinpoint errors. This is a minor code change which fixes that. Reviewed By: aakhundov Differential Revision: D45043453 fbshipit-source-id: f44fd399aa24f45178f89b667218d042cd966676
Summary: Pull Request resolved: facebookincubator#587 Adds AIT BatchNorm1d, BatchNorm2d, and BatchNorm3d FE module Reviewed By: terrychenism Differential Revision: D44922251 fbshipit-source-id: d2ae030fea244a1a398609209b52798d4444201c
Summary: Pull Request resolved: facebookincubator#588 as titled Reviewed By: qxy11, wushirong Differential Revision: D45055690 fbshipit-source-id: 3930eec56f56bd9253c7aeac9562d8131493710f
Summary: Pull Request resolved: facebookincubator#569 Reviewed By: wfanzju, wushirong Differential Revision: D44942651 fbshipit-source-id: af5b6c66d578d7460e297309bcc21920a632489a
Summary: Pull Request resolved: facebookincubator#585 Now that bfloat16 and float32 are also supported, we don't need to hardcode to float16. Reviewed By: tenpercent Differential Revision: D45028307 fbshipit-source-id: 2fb3d2d22f2bb3761505cecf1d7f220497e529cc
Summary: Pull Request resolved: facebookincubator#593 As titled. std in fx2ait is represented by combinations of arithmetic ops and 2 reduce ops.: ``` Y = sqrt( mean(pow(X-mean(X))) ) ``` However issue occurs when the first deduce op uses `keepdim=False`. It caused issue with subsequent sub op where X has old dimension while mean(X) only has 1 dim less than that. Also, it seems there is no unittest for acc_ops when tracing std. This diff added the test Reviewed By: amateurcoffee Differential Revision: D45105942 fbshipit-source-id: 04f9c1105a2a6a711d025d5c85b95147343d0ecd
Summary: Pull Request resolved: facebookincubator#591 A race condition in FBCUDA could be triggered when it was launched in a parallel. See https://www.internalfb.com/phabricator/paste/view/P697768916 Update: After further investigation, the source of the race condition appears to be between multiple processes and target_def.py CUDA.__exit__(...) code deleting things, and FBCUDA.__init__(...) code copying and writing files into directories that the other process is busy deleting. It was not obvious that CUDA.__exit__ would delete paths that FBCUDA.__init__() created. After short discussions, we concluded that there is no way to clean this up with minimal risk code changes - a larger rewrite of target_def.py would be neccessary, also to prevent resource leaks, but is out of scope for a hotfix. So the safest way to go, and what is now implemented here is the following: * Revert target_def.py to the pre-build-cache state where these issues were avoided by creating new randomly named temp directories every time. * In order to not get new build cache keys everytime which would make caching impossible, the build cache makefile normalization was adapted instead. The impact on build times appears neglible ( approx. 0.08 seconds ) Unit tests were added and modified accordingly as well. Reviewed By: wushirong, aakhundov Differential Revision: D45093471 fbshipit-source-id: 9473d96ac33178dcddc98cf988fa4560c7e85f96
Summary: Pull Request resolved: facebookincubator#584 ATT Reviewed By: muchulee8, chenyang78, alexanderguzhva, tenpercent Differential Revision: D45028066 fbshipit-source-id: 049ffee0f4e8f5f7e6fbd0b3517e2db7e9520c9f
) Summary: Pull Request resolved: facebookincubator#792 Reviewed By: terrychenism, Yinan-Zhao Differential Revision: D46828343 fbshipit-source-id: a190e2b39ee3fc8927560b6faf6d980c2c4b6a55
Summary: Pull Request resolved: facebookincubator#783 As titled, this diffs adds supports for relational operations (e.g. ge, le, gt, lt, eq, ne). The expected behavior is to match equivalent operators in Torch, e.g. https://pytorch.org/docs/stable/generated/torch.ge.html There are multiple constraints: (1) Type promotions are not supported. (2) Broadcast is not supported yet, so it expects tensor of the same shape - this maybe fixed in future iterations (3) It uses the relational operators, i.e, >=, <, > rather than comparison function that supports different level of precision, e.g. https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH____HALF__COMPARISON.html - this maybe fixed in future iterations Reviewed By: aakhundov Differential Revision: D46770449 fbshipit-source-id: ed58c8ff0148b389dc80f76fe6a4413f21c36a2f
…are already set (facebookincubator#799) Summary: Pull Request resolved: facebookincubator#799 In some cases, when we fuse a slice op with a strided op (e.g. concat), the relevant tensor accessors in the strided op's input tensors may have already be updated. Let's skip fusion in such a case. Note that technically, we might be able to perform fusion in some senarios, but let's handle those later if we see needs. Reviewed By: qxy11 Differential Revision: D47028458 fbshipit-source-id: a0648c6854db9f69a83fefa698d70eb17b272f70
Summary: Pull Request resolved: facebookincubator#791 As titled, this diffs adds supports for where operations. The expected behavior is to match equivalent operators in Torch, i.e. https://pytorch.org/docs/stable/generated/torch.where.html Reviewed By: aakhundov Differential Revision: D46957405 fbshipit-source-id: db4bdf4f2d91d154fb0c9ee092bf6429679b63db
…r#794) Summary: Pull Request resolved: facebookincubator#794 The existing tests don't use the `slice_reshape_scatter` op, so the `split_large_slice_scatter` logic is not tested. If we remove `split_large_slice_scatter` from `optimize_graph`, the existing tests can still pass. The newly added test cases would fail without `split_large_slice_scatter`. Reviewed By: muchulee8 Differential Revision: D46770757 fbshipit-source-id: 69a48580df6e5108a5e03097dec2835934c718db
Summary: Pull Request resolved: facebookincubator#800 Initialize random weights for AIT constants during model compilation to prevent identical weights being compared when testing accuracy of PT module vs AIT module. Reviewed By: henryhu6 Differential Revision: D47031569 fbshipit-source-id: f063a8b13d3a530f7c667ce4b2259f9177bdd4fa
Summary: Pull Request resolved: facebookincubator#803 ATT Reviewed By: wushirong Differential Revision: D47060963 fbshipit-source-id: 8fd0f57e8b3e0d85396a10397e8fa0a380a9cd8c
…tor#784) Summary: Currently SD Alternative pipeline examples use `demo.py` script. I think it should use `demo_alt.py` instead Other minor fixes: - made demo_alt.py example commands multiline - fixed CompVis spelling and added href to it. Pull Request resolved: facebookincubator#784 Reviewed By: chenyang78 Differential Revision: D47039823 Pulled By: ipiszy fbshipit-source-id: 885cbcef4a7904936da66d817d0eb62e06f5335a
…ebookincubator#801) Summary: Pull Request resolved: facebookincubator#801 Split slice_scatter into multiple ones if it has too many inputs. The process is very similar to split slice_reshape_scatter. Added the TensorAccessor attribute in slice_scatter op (but will only use its offset field) to make the split logic work. Reviewed By: chenyang78 Differential Revision: D46962881 fbshipit-source-id: 40457bfd5f9ec607802a70d01e7020aac0e6b3c8
Summary: Pull Request resolved: facebookincubator#777 Reviewed By: chenyang78 Differential Revision: D47039764 Pulled By: ipiszy fbshipit-source-id: 4a2fa9228272ed32544498b68af4f4d42c02a460
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.