forked from facebookincubator/AITemplate
-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge upstream ci #66
Open
fsx950223
wants to merge
468
commits into
amd-develop
Choose a base branch
from
merge_upstream_ci
base: amd-develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Summary: Add `AIT_TIME_COMPILATION` description to [env.rst](https://facebookincubator.github.io/AITemplate/reference/env.html) Follow up change for facebookincubator#347 Pull Request resolved: facebookincubator#356 Reviewed By: alexanderguzhva Differential Revision: D43752096 Pulled By: tenpercent fbshipit-source-id: f521248d661b87ce82d954a40a17a333ffbcc5b2
Summary: Fixes: - fix typo `"output1": out0_ait` -> `"output1": out1_ait` - `outputs` array is created based on `len(input_name_to_idx)`. but it should use `len(output_name_to_idx)` instead. - simpler code to create fixed size array Pull Request resolved: facebookincubator#352 Reviewed By: tenpercent, houseroad Differential Revision: D43756996 Pulled By: muchulee8 fbshipit-source-id: 9e8dcea10d9a22c161a2a497823725d653e67350
Summary: Pull Request resolved: facebookincubator#358 as titled Reviewed By: chenyang78 Differential Revision: D43762214 fbshipit-source-id: 4f5457ac90f53e98ffcf4bdf3b0b88459c8f2d4b
…cebookincubator#360) Summary: In some dev environment with constraints, installing pycuda is not feasible. This PR enables us to first check if pycuda is available. If not, we go back with the old approach to detecting the target. Pull Request resolved: facebookincubator#360 Reviewed By: tenpercent Differential Revision: D43775402 Pulled By: chenyang78 fbshipit-source-id: d9ab747110e41cba8f993ede83a6794db34bfb66
Summary: Pull Request resolved: facebookincubator#312 ### Implement expand operator CUDA backend Adding CUDA backend implementation for expand: https://fburl.com/code/nb2mcsmg. The operator semantic should be the same as the pytorch version https://fburl.com/fljywh6p. #### Implementation The previous expand operator was a no-op version, which only worked under very limited conditions, namely when it expanded just a single, already existing direction, and could be merged into a following elementwise op that supports tensor broadcasting. This new version actually expands the tensor, supporting multiple expansion dimensions, dynamic shapes and adding dimensions, just like the pytorch version. There are three CUDA kernels implemented, one dealing with the general case, and two which are specialized in order to be faster in certain scenarios. The pytorch version is in principle more effective, nevertheless, because in pytorch it just needs to create a new view on a source tensor with different read strides. As AIT has no general notion of strides for tensor dimensions, this is not a real option at the moment, unless we add that support to tensors and operators on them. #### Further possible optimizations (not part of this PR ) * When adding leading dimensions, this can be decomposed into writing an upper part of the tensor ( requiring strided reads or writes ) and then repeatedly copying that tensor ( which can be accomplished using effective sequential reads and writes and can utilize shared memory ) * Further operator fusions should be possible * With all immutable dimensions, a more efficient implementation would be possible via loop unrolling and precalculation of strides etc. Reviewed By: chenyang78 Differential Revision: D43419041 fbshipit-source-id: 84ec2c4716c3e21860d1d55807cf649ed543ba2e
Summary: This PR enabled bmm_ccr/bmm_rrr and concat fusion. It also clean-up the relevant unittests a bit. Pull Request resolved: facebookincubator#359 Reviewed By: tenpercent Differential Revision: D43775333 Pulled By: chenyang78 fbshipit-source-id: 7ce94b00066f7f5142388eee397d6959cde183e0
Summary: Pull Request resolved: facebookincubator#362 Reviewed By: alexanderguzhva Differential Revision: D43813375 Pulled By: chenyang78 fbshipit-source-id: d9c65bf2b15e6362343b6d4e77a510853fad5613
Summary: This PR enabled dynamic h/w for conv2d and d/h/w for conv3d. The profiling strategy is not optimal as we only profile with the max d/h/w values. We will implement some better strategy (e.g. bucketing) later. We also removed duplicate codes in conv3d_bias. Pull Request resolved: facebookincubator#363 Reviewed By: terrychenism Differential Revision: D43821796 Pulled By: chenyang78 fbshipit-source-id: 8f91b9193becf1727b704573a9bdca5a036d8b8d
Summary: Pull Request resolved: facebookincubator#357 If there is more than one most frequent dimension in the input shapes, the leftmost one: the one with the lowest position score (sum of position indices in the shapes) is picked as the batch size. If there are multiple most frequent dimensions with the same position score, the choice is still arbitrary. Reviewed By: wushirong Differential Revision: D43755669 fbshipit-source-id: a8c10bbd2977e953ce44a22b0ee2df8e7c976963
Summary: Pull Request resolved: facebookincubator#366 Cannot compare int with NoneType, so need to justify None before compare. Reviewed By: aakhundov Differential Revision: D43815532 fbshipit-source-id: 384561e43bd51007b6c93530e5087a110758df12
Summary: Pull Request resolved: facebookincubator#364 Previously, `make_jagged`'s back-end was relying on whether the `batch_dim` is present in any Tesnor's `_attrs["shape"]`, to decide if the `batch_dim` must be set (to `offsets.lengths[0] - 1`) or validated (to be equal to that). This is problematic for the cases, where the `batch_dim` is present in a Tensor shape in the downstream graph, hence is supposed to be set by `make_jagged` instead of being validated. One such case arises in the `jagged_to_dense` op, where the output dense Tensor's first `batch_dim` dimension is not known to the runtime until the input jagged Tensor is "unwrapped". In this case, `make_jagged` must assign the `batch_dim` present inside jagged Tensor's `JaggedIntVar`, instead of validating it, so that it gets the value by the time the output dense Tensor with the `batch_dim` in its `_attrs["shape"]` is processed further. To mitigate this, in this diff the `make_jagged`'s condition to set vs. validate the `batch_dim` is changed to whether `batch_dim` is equal zero or not in the runtime. Being equal to zero means that the `batch_dim` has not yet been initialized (dynamic dimensions in the runtime are set to zero on declaration), which, in turn, means it must be set. If the `batch_dim` is not equal to zero, this means it has already been set, hence must be validated. Reviewed By: ipiszy Differential Revision: D43712183 fbshipit-source-id: a4570729bf6ebfba21034330b68c0362f685c72c
Summary: Pull Request resolved: facebookincubator#371 Reviewed By: wushirong Differential Revision: D43859425 fbshipit-source-id: ed4e568d44a81c52769bf0c43ec775c4ddc88503
Reviewed By: frank-wei Differential Revision: D43677477 fbshipit-source-id: b916f43bc7170de8bfcfb5468941d7dc82f26524
Summary: Pull Request resolved: facebookincubator#377 Dropout is a noop at inference. Removed with acc tracer. Reviewed By: frank-wei, wushirong Differential Revision: D43881227 fbshipit-source-id: 0246365e6facc6dfb13843fa9854802f35c0938a
Summary: If the target resolution cannot be divided by 64, the compilation process fails on the UNet step. This PR asserts the width and height immediately instead of compiling the CLIP model and failing a few minutes in. closes facebookincubator#345 Pull Request resolved: facebookincubator#355 Reviewed By: tenpercent Differential Revision: D43784017 Pulled By: muchulee8 fbshipit-source-id: 7ab7581f80c4e649e1afa4a22b53da3aac959c13
Summary: Updated the softmax wiki link and add images so that the wiki will have link to refer to Pull Request resolved: facebookincubator#379 Reviewed By: muchulee8 Differential Revision: D43890078 Pulled By: tissue3 fbshipit-source-id: 5893e904c14b684b16fe8601419cba74bf0d50d7
Summary: Pull Request resolved: facebookincubator#372 Reviewed By: jiaqizhai, khabinov Differential Revision: D43726142 fbshipit-source-id: 90add0c73e9b7725a4a0969fd3ba14ae81d3e481
Summary: Pull Request resolved: facebookincubator#374 Fallback for input type half2 version of fast_tanh in CUDA_ARCH < 75 case is implemented. Reviewed By: aakhundov Differential Revision: D43871666 fbshipit-source-id: 5e9bed21996eb9cd5e71fdb3851e7ab9d20826cb
Summary: Pull Request resolved: facebookincubator#370 Turns out, `JaggedIntVar` wasn't hashable. This created problems for some passes (e.g., [here](https://github.com/facebookincubator/AITemplate/blob/75f54510d8e02114e013200a66ea9a5d433e5f81/python/aitemplate/compiler/transform/transform_strided_op_and_view_op.py#L44-L48)). This diff adds a `__hash__` function to `JaggedIntVar`. And because it should pretend to be a regular `IntVar` by default, the new `__hash__` function has the structure of the `IntVar.__hash__`. Reviewed By: ipiszy Differential Revision: D43857198 fbshipit-source-id: dc569e02731ae07aa522ad06d45d4b2f8893d336
Summary: Pull Request resolved: facebookincubator#380 In this diff, `jagged_to_dense` front-end and back-end op is added with the vectorized I/O. We reuse many of the utilities in `testing/jagged_utils.py` similar to `backend/common/elementwise_common.py` in D43482363. A unit test and benchmark are included. ## Implementation Details Since the output is dense, we adopt the calculations based on dense shape and apply padding when current element is outside of jagged shape from input. Reviewed By: aakhundov Differential Revision: D43562375 fbshipit-source-id: 930ad6793a9c6260497847330abd0a83e5861ac9
Summary: Pull Request resolved: facebookincubator#368 Cleaned up code for expand op: * Added more documentation comments & type hints * Improved variable & function naming * Simplified code ( eliminated potentially unneccessarily specialized kernels ) Reviewed By: chenyang78 Differential Revision: D43844913 fbshipit-source-id: 3734e1b47d108398d5e1513e301a193e54839dc9
Summary: Pull Request resolved: facebookincubator#354 Applied a one-off refactoring script, to change all relative imports within AITemplate to absolute imports. Then ran arc lint to make sure formatting is correct. ### Why? IDEs like VSCode or PyCharm have problems resolving the paths to packages imported via relative imports, as they don't know the basepath. Now we can navigate to all imported symbols using CMD+click on the symbol. Here is the script. It is intended for one-off usage, so I did not bother with code style or reusability. ``` import os import re from pathlib import Path def process_relative_imports(path, basepath, basepackage): path = os.path.abspath(str(path)) basepath = os.path.abspath(str(basepath)) if not path.startswith(basepath): return relpath = path[len(basepath) :].strip("/") pparts = relpath.split("/") def dot_replacer(match): dots = match.group(2) pkg = basepackage + ".".join(pparts[: -len(dots)]) pkg = pkg.strip(".") replacement = match.group(1) + pkg + "." + match.group(3) return replacement.replace("..", ".") with open(path, "rt", encoding="utf8") as f: contents = f.read() rcontents = re.sub( r"(^from )(\.+)([^\.].*import .*$)", dot_replacer, contents, flags=re.MULTILINE, ) with open(path, "wt", encoding="utf8") as f: f.write(rcontents) print(f"Wrote {path}") if __name__ == "__main__": allpyfiles = [str(path) for path in Path(".").rglob("*.py")] for p in allpyfiles: print(p) if p.endswith("extra_cutlass_generator.py"): continue process_relative_imports(p, ".", "aitemplate.") ``` Reviewed By: ipiszy, chenyang78, tenpercent Differential Revision: D43715713 fbshipit-source-id: 1c2eaaaadc2f1edf8f4e378bc2781c5f851e80ba
Summary: Pull Request resolved: facebookincubator#382 ATT Reviewed By: alexanderguzhva Differential Revision: D43920370 fbshipit-source-id: b387815948bff5b8791069c37683df8f3ff7273b
…tor#383) Summary: Pull Request resolved: facebookincubator#383 ATT Reviewed By: wushirong Differential Revision: D43923044 fbshipit-source-id: 77a21ddf9a1a11180f9bde2b132dca43964e2a88
…00 to V100 (facebookincubator#384) Summary: Pull Request resolved: facebookincubator#384 ATT Reviewed By: wushirong Differential Revision: D43924250 fbshipit-source-id: 7b438ccc420d99352855b0e69088184df075afe2
…#381) Summary: Pull Request resolved: facebookincubator#381 Setting AIT_PLOT_SHORTEN_TENSOR_NAMES=1 environment variable makes AITemplate to replace tensors with long names with shortened names (like URL shortener does) during building a plot. Reviewed By: chenyang78 Differential Revision: D43918759 fbshipit-source-id: d820dfae8cbfdd5c9e0ac750709736a17a94ceeb
…graph in a third-party python code. (facebookincubator#388) Summary: Pull Request resolved: facebookincubator#388 `_graph.json` files will be generated in addition to `_graph.txt` files under the same circumstances. Such a file can be loaded using `json.loads()` call. Reviewed By: chenyang78 Differential Revision: D43951586 fbshipit-source-id: 392ee5b43f4746f428a1d92ba2bcc5ab4cbf11bb
Summary: Pull Request resolved: facebookincubator#389 att Reviewed By: chenyang78 Differential Revision: D43953467 fbshipit-source-id: 61dc27f91210bdf0984f6b0ba3645bd9daeed819
Summary: Pull Request resolved: facebookincubator#385 Reduce assignment of constants that are not necessary. Reviewed By: khabinov, morgendave, wushirong Differential Revision: D43923768 fbshipit-source-id: 1ec6869dfa01964cd4ac0c3cdd7600c604ade9d5
Summary: Pull Request resolved: facebookincubator#392 We had int_elementwise support for dynamic shape in aten2ait, but didn't add it to fx2ait. Fx2ait were able to calculate static shape, but recently IFR model requests dynamic shape calculation: https://fburl.com/code/7eag5h8a Therefore added the support. Reviewed By: khabinov, wushirong Differential Revision: D43964418 fbshipit-source-id: 32e64e18e1acd1f6152b6448361fd472e4dbfe8d
…cubator#546) Summary: Pull Request resolved: facebookincubator#546 This would help to reduce the test duration Reviewed By: houseroad Differential Revision: D44782015 fbshipit-source-id: 3f4ce8d3bb07766eaef866ec19d41ba990ae5b38
…cebookincubator#548) Summary: Pull Request resolved: facebookincubator#548 This would help to reduce the test duration Reviewed By: houseroad Differential Revision: D44782077 fbshipit-source-id: bc69c15b67de543ce3c23fabf205c8050b646590
…bator#545) Summary: Pull Request resolved: facebookincubator#545 This would help to reduce the test duration Reviewed By: houseroad Differential Revision: D44781395 fbshipit-source-id: 438062d52630e86e3346a4c9ac7f8ed6bcb34d7d
…ncubator#547) Summary: Pull Request resolved: facebookincubator#547 This would help to reduce the test duration Reviewed By: khabinov, houseroad Differential Revision: D44781804 fbshipit-source-id: 02501b800663bfc5a75f4c684ec9c61ebe8ba750
…bookincubator#549) Summary: Pull Request resolved: facebookincubator#549 The `check_sequence_lengths` attribute of the `make_jagged` ops is not carried over in the `dedup_make_jagged_ops` from the old ops to the new one. This is a bug that causes problems in the setting where `check_sequence_lengths` is set to `False` in the existing ops, as the default value is `True`. The diff fixes the bug by carrying over the attribute in the pass. Reviewed By: amateurcoffee, tissue3 Differential Revision: D44808132 fbshipit-source-id: 5bdd8d8d764cafbf06e0bacfaffe973db2aecf25
Summary: Pull Request resolved: facebookincubator#544 Previously, the same `Makefile` was used (and reused) for building all different kinds of profilers for different ops and op configurations. This has caused an issue when running parallel unit tests in a local environment, as contention for the same `./tmp/profiler/Makefile` has led to different tests rewriting it before being read by others. As a result, the tests were building each others' profilers and were left without their own. The latter manifested itself in the following error, as the profiler executable that should have been built wouldn't have been there by the time the compilation would have ended: ``` Profiler ./tmp/profiler/gemm_rcr/gemm_rcr_9e46850d5286ecc7e078b5b7f76afbcac62967b4_3 is not executable ``` In this diff, the built profiler target names are included in the per-profiler `Makefile` name, hence excluding the possibility of different tests rewriting each other profiler `Makefile`s. This resolves the issue and the above error is no longer raised. Importantly, it is acceptable for the tests to rewrite the `Makefile` of the same profiler targets, as the content will also be the same. Additionally, a few retries (with a delay) are made to check if the profiler binary is executable in the `gemm_universal` front-end. This is to handle the cases where the same binary is being compiled in parallel by more than one unit test, so that by the time one tries to check executability, the other is in process of writing the compiled result. Reviewed By: kadeng Differential Revision: D44788627 fbshipit-source-id: 3080fadb7d3114615a49b214bb4bb65abca15ef7
Summary: Pull Request resolved: facebookincubator#554 `test_make_jagged_dedup` fails in some CircleCI jobs (mostly `main`), due to a minor discrepancy: ``` AssertionError: Tensor-likes are not close! Mismatched elements: 1 / 224 (0.4%) Greatest absolute difference: 0.01708984375 at index (0, 16) (up to 0.01 allowed) Greatest relative difference: 0.03127792672028597 at index (0, 16) (up to 0.01 allowed) ``` The error, probably, accumulates due to the two gemm ops being applied back-to-back in the test. Here we increase the tolerance to `5e-2` to avoid the test failure in CircleCI. Reviewed By: alexanderguzhva Differential Revision: D44815979 fbshipit-source-id: 02b73c45487cc5a300e04e4f131a7664bcccb6a4
Summary: Pull Request resolved: facebookincubator#553 When multiple unit tests are running in parallel, a few can be building the same profiler binary (e.g., for the same op configuration from both tests). In such cases, it may happen that, by the time one test attempts to execute the build profiler binary, another test is in the middle of writing the compilation result. This triggers an error, which before this diff has caused a failure of the async task running profiler commands and eventual profiler timeout. The diff adds retries to profiler execution, hence remediating the problem described above. Reviewed By: alexanderguzhva Differential Revision: D44815907 fbshipit-source-id: c9082e8bc9c59ad1f629373e156ba4661cc89795
Summary: Pull Request resolved: facebookincubator#550 Currently the fuse parallel gemm pass doesn't check if tensors being fused and eliminated are output tensors. This results in errors like ``` "ValueError: Output output188 was not found in the graph after optimizations." ``` during AIT compilation. This diff adds the check in to make sure these aren't removed from the optimized graph. Reviewed By: frank-wei, houseroad Differential Revision: D44806086 fbshipit-source-id: a1e1f286c5377afe8464aba1cb0c5d7f83de9984
Summary: Pull Request resolved: facebookincubator#556 There is a bug in the current GEMM profiler's way of using the memory pool: the tensors are requested only once for the entire GEMM kernel's profiling loop. The fact that the same tensors / memory regions / pointers are used in all iterations of the kernel's profiling loop render the memory pool virtually useless. The risk is that small inputs may stick in the GPU's L2 cache, leading to unreliable profiling results. In this diff we fix the bug by modifying the GEMM back-end profiler templates in a way that the `memory_pool->RequestTensorByIdx(...)` calls are made *within* the profiling loop, hence rotating the inputs for every call and eschewing L2 caching. Experiments with simple GEMM on small problem sizes (e.g., `M=1024, N=512, K=256`) have shown that, after the fix, the runtimes measured in profiling can grow up to 30% for some of the kernels. The selected best kernel can also change as a result. Reviewed By: tenpercent Differential Revision: D44816867 fbshipit-source-id: 27259671614422cbe3072d578842b5bc617dc830
Summary: Pull Request resolved: facebookincubator#560 Reviewed By: henryhu6 Differential Revision: D44854358 Pulled By: terrychenism fbshipit-source-id: a80e704f35aea69ba57c1b0d7bf1785312aa88bf
Summary: Pull Request resolved: facebookincubator#551 Reviewed By: tenpercent Differential Revision: D44814768 fbshipit-source-id: 71184eeb0c95bafbd853ea4685e2135423c7df8b
…vistor (facebookincubator#552) Summary: Pull Request resolved: facebookincubator#552 cutlass::gemm::GemmCoord uses int values as coordinates under the hood, while AIT might use int64_t variables in {M, N, K} constructor. So, narrowing conversion is needed. Reviewed By: tenpercent Differential Revision: D44814784 fbshipit-source-id: 521fb91570fea19c4a651e71ea93e2e0c787eb48
…kincubator#530) Summary: Pull Request resolved: facebookincubator#530 ATT. Also updated b2b bmm kernels to support alpha1_divide_by_seq_len. Reviewed By: aakhundov, kadeng Differential Revision: D44451037 fbshipit-source-id: dc104bed4edff38d99d2117815d700b516a50c73
Summary: Pull Request resolved: facebookincubator#563 The `recude_*` ops seem to fail [this assertion](https://github.com/facebookincubator/AITemplate/blob/main/python/aitemplate/backend/cuda/reduce/reduce_small_axis.py#L316) when the last input dimension is `IntVar`. The problem seems to be that the reduction axis is assumed to be -1 in the `_get_read_vector_type` function, even if it's actually not. Hence the check [here](https://github.com/facebookincubator/AITemplate/blob/main/python/aitemplate/backend/cuda/reduce/reduce_small_axis.py#L413) against the actual reduction axis passes, but the subsequent aforementioned assertion fails. This diff replaces the assertion by using the `input_type` as the `read_vector_type` if the last input dim is `IntVar`, as the `IntVar` reduction dim's value can be odd in the runtime. Instead of failing the assertion the code compilation successfully completes. Reviewed By: chenyang78 Differential Revision: D44915126 fbshipit-source-id: 34a8d9b8f0b678468ed1e80f4ae56b34aafc1c5e
Summary: Pull Request resolved: facebookincubator#541 See T148695911 With D44229622 we could prove that it should be possible to speed up unit tests and therefore also CI runs considerably. The task was to integrate the build cache with Sandcastle CI in order to speed up our CI process. For reference about considered options, tradeoffs and decision process: Original design doc at https://docs.google.com/document/d/1GHuhIJ83CsS3hgB8bV53TDTIqavqpPl4guP_kDcWdII/edit Final design review meeting slides & notes: https://docs.google.com/presentation/d/1bICc-OtCp1kgisL3SOCN7XYN4ZRn9a6JX62eMjFUI68/edit#slide=id.g1e0053f1f88_0_53 Implementation: [x] Created a Manifold-based build cache implementation [x] incorporated it into the non-OSS part of the codebase, similar to fb/detect_model.py in fb/build_cache.py [x] Sets TTL on stored objects. Resets this TTL on read ( asynchronously, no need to wait for this before continuing ) [x]Archiving and storing of files to be cached happen asynchronously in order not to delay the tests. [x]Investigated whether we can get Manifold latency down by creating a new bucket with different settings ( did not work for me) Add features and config options to: [x] Disabled caching for a compile_model call, entire unit test or globally ( env var ) [x]Disabled the build cache for profiling only ( env var ) Not use the cache with a certain probability (in order to keep the build system and cache under test) I [x]Incorporated info from question on Manifold Users Workplace group, whether we can use the official Manifold Client for this usecase ( https://fb.workplace.com/groups/ManifoldUsers/permalink/1682913152123392/ ) (Unless we quickly get an answer, the first implementation should use the deprecated manifold client, because that is proven to work and safe in multiprocessing. ) [x] Does not cache .obj files ( unneccessary, and takes up large amount of storage in many cases ) [x] Added unit test ( mock Manifold client ) Reviewed By: ipiszy, aakhundov Differential Revision: D44642328 fbshipit-source-id: 9d2ec65e953d7f513d4325a7d1cc834f1b5afb75
…or#565) Summary: Pull Request resolved: facebookincubator#565 There were reports of corrupted CUTLASS include directories which led to build failures which could only be resolved by manually deleting a directory generated by the FBCUDA target below /tmp. This fix attempts to make the corresponding logic more robust against edge cases and errors, as well as fail early if assertions are violated. Reviewed By: aakhundov Differential Revision: D44918599 fbshipit-source-id: e02e8f272ac8c625522c069a98a679383bbff883
Summary: Pull Request resolved: facebookincubator#562 conv1d can be expressed in terms of conv2d, so I didn't introduce any new kernel, but customized conv2d kernel generation Reviewed By: terrychenism Differential Revision: D44894688 fbshipit-source-id: c6e1d8894498302cf43bfe8c07ee9779b94fe3d2
Summary: Pull Request resolved: facebookincubator#566 Refactoring "arange" tensor used in time embeddings to be model parameter. Reviewed By: henryhu6 Differential Revision: D44903108 fbshipit-source-id: 227a2d4d2fee126dab02393af71ba35bef82936d
…ator#570) Summary: Consider we have a following graph: concat_0 = concatenate(x0, x0) reshape_1 = reshape(concat_0) concat_2 = concat(reshape_1, x1) concat_3 = concatenate(concat_0, x2) Previously, our move_view_ops pass would end up with an infinite loop, because it turned the graph into forms that were always valid for another iteration, e.g. (1) after the first iteration: concat_0 = concatenate(x0, x0) concat_2 = concat(concat_0, x1) new_reshape = reshape(concat_2) concat_3 = concatenate(new_reshape, x2) (2) after the second iteration: concat_0 = concatenate(x0, x0) new_reshape = reshape(concat_0) concat_2 = concat(new_reshape, x1) concat_3 = concatenate(concat_0, x2) and so on. This PR fixed the issue by skipping the pattern. Pull Request resolved: facebookincubator#570 Reviewed By: hl475 Differential Revision: D44946922 Pulled By: chenyang78 fbshipit-source-id: ff91fef90218feb4679e5b073979a8de02d912a8
Summary: Pull Request resolved: facebookincubator#516 Symbolic shape support has landed, remove hacks that were used. Reviewed By: tissue3 Differential Revision: D44482705 fbshipit-source-id: 685c74efa0b4a2cec6a2f963fff4b0437b44a32e
…acebookincubator#559) Summary: Pull Request resolved: facebookincubator#559 `_fuse_strided_op_and_cat` pass inside `transform_strided_ops` shouldn't fuse GEMM and concat if concatenation is happening along a dimension >= rank of the original shape. This happens, for example, when GEMM output of shape `(M, N)` is unsqueezed to `(M, N, 1)` and concatenated with another `(M, N, 1)`. Such fusion would require GEMM to write the last dimension into memory in a non-contiguous way, which is not supported for row-major output (only one stride is supported). However, fusion is possible when unsqueezed dimension is internal - e.g. when final shape is `(M, 1, N)`. Method `TensorAccessor.is_rightmost_dim_contiguous` checks if fusion is possible based on these criteria. Reviewed By: tissue3, aakhundov Differential Revision: D44747795 fbshipit-source-id: 4fbb005ce27d32654bda68f8405ec06b23f17a1a
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.