Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge upstream ci #66

Open
wants to merge 468 commits into
base: amd-develop
Choose a base branch
from
Open

Merge upstream ci #66

wants to merge 468 commits into from

Conversation

fsx950223
Copy link

No description provided.

apivovarov and others added 30 commits March 2, 2023 15:50
Summary:
Add `AIT_TIME_COMPILATION` description to [env.rst](https://facebookincubator.github.io/AITemplate/reference/env.html)

Follow up change for facebookincubator#347

Pull Request resolved: facebookincubator#356

Reviewed By: alexanderguzhva

Differential Revision: D43752096

Pulled By: tenpercent

fbshipit-source-id: f521248d661b87ce82d954a40a17a333ffbcc5b2
Summary:
Fixes:
-  fix typo `"output1": out0_ait` -> `"output1": out1_ait`
- `outputs` array is created based on `len(input_name_to_idx)`. but it should use `len(output_name_to_idx)` instead.
- simpler code to create fixed size array

Pull Request resolved: facebookincubator#352

Reviewed By: tenpercent, houseroad

Differential Revision: D43756996

Pulled By: muchulee8

fbshipit-source-id: 9e8dcea10d9a22c161a2a497823725d653e67350
Summary:
Pull Request resolved: facebookincubator#358

as titled

Reviewed By: chenyang78

Differential Revision: D43762214

fbshipit-source-id: 4f5457ac90f53e98ffcf4bdf3b0b88459c8f2d4b
…cebookincubator#360)

Summary:
In some dev environment with constraints, installing pycuda is not feasible.

This PR enables us to first check if pycuda is available. If not, we go back with the old approach to detecting the target.

Pull Request resolved: facebookincubator#360

Reviewed By: tenpercent

Differential Revision: D43775402

Pulled By: chenyang78

fbshipit-source-id: d9ab747110e41cba8f993ede83a6794db34bfb66
Summary:
Pull Request resolved: facebookincubator#312

### Implement expand operator CUDA backend

Adding CUDA backend implementation for expand: https://fburl.com/code/nb2mcsmg.

The operator semantic should be the same as the pytorch version https://fburl.com/fljywh6p.

#### Implementation

The previous expand operator was a no-op version, which only worked under very limited conditions, namely when it expanded just a single, already existing direction, and could be merged into a following elementwise op that supports tensor broadcasting.

This new version actually expands the tensor, supporting multiple expansion dimensions, dynamic shapes and adding dimensions, just like the pytorch version. There are three CUDA kernels implemented, one dealing with the general case, and two which are specialized in order to be faster in certain scenarios.

The pytorch version is in principle more effective, nevertheless, because in pytorch it just needs to create a new view on a source tensor with different read strides. As AIT has no general notion of strides for tensor dimensions, this is not a real option at the moment, unless we add that support to tensors and operators on them.

#### Further possible optimizations (not part of this PR )

 * When adding leading dimensions, this can be decomposed into writing an upper part of the tensor ( requiring strided reads or writes ) and then repeatedly copying that tensor ( which can be accomplished using effective sequential reads and writes and can utilize shared memory )
 * Further operator fusions should be possible
 * With all immutable dimensions, a more efficient implementation would be possible via loop unrolling and precalculation of strides etc.

Reviewed By: chenyang78

Differential Revision: D43419041

fbshipit-source-id: 84ec2c4716c3e21860d1d55807cf649ed543ba2e
Summary:
This PR enabled bmm_ccr/bmm_rrr and concat fusion. It also clean-up the relevant unittests a bit.

Pull Request resolved: facebookincubator#359

Reviewed By: tenpercent

Differential Revision: D43775333

Pulled By: chenyang78

fbshipit-source-id: 7ce94b00066f7f5142388eee397d6959cde183e0
Summary: Pull Request resolved: facebookincubator#362

Reviewed By: alexanderguzhva

Differential Revision: D43813375

Pulled By: chenyang78

fbshipit-source-id: d9c65bf2b15e6362343b6d4e77a510853fad5613
Summary:
This PR enabled dynamic h/w for conv2d and d/h/w for conv3d. The profiling strategy is not optimal as we only profile with the max d/h/w values. We will implement some better strategy (e.g. bucketing) later.

We also removed duplicate codes in conv3d_bias.

Pull Request resolved: facebookincubator#363

Reviewed By: terrychenism

Differential Revision: D43821796

Pulled By: chenyang78

fbshipit-source-id: 8f91b9193becf1727b704573a9bdca5a036d8b8d
Summary:
Pull Request resolved: facebookincubator#357

If there is more than one most frequent dimension in the input shapes, the leftmost one: the one with the lowest position score (sum of position indices in the shapes) is picked as the batch size.

If there are multiple most frequent dimensions with the same position score, the choice is still arbitrary.

Reviewed By: wushirong

Differential Revision: D43755669

fbshipit-source-id: a8c10bbd2977e953ce44a22b0ee2df8e7c976963
Summary:
Pull Request resolved: facebookincubator#366

Cannot compare int with NoneType, so need to justify None before compare.

Reviewed By: aakhundov

Differential Revision: D43815532

fbshipit-source-id: 384561e43bd51007b6c93530e5087a110758df12
Summary:
Pull Request resolved: facebookincubator#364

Previously, `make_jagged`'s back-end was relying on whether the `batch_dim` is present in any Tesnor's `_attrs["shape"]`, to decide if the `batch_dim` must be set (to `offsets.lengths[0] - 1`) or validated (to be equal to that).

This is problematic for the cases, where the `batch_dim` is present in a Tensor shape in the downstream graph, hence is supposed to be set by `make_jagged` instead of being validated. One such case arises in the `jagged_to_dense` op, where the output dense Tensor's first `batch_dim` dimension is not known to the runtime until the input jagged Tensor is "unwrapped". In this case, `make_jagged` must assign the `batch_dim` present inside jagged Tensor's `JaggedIntVar`, instead of validating it, so that it gets the value by the time the output dense Tensor with the `batch_dim` in its `_attrs["shape"]` is processed further.

To mitigate this, in this diff the `make_jagged`'s condition to set vs. validate the `batch_dim` is changed to whether `batch_dim` is equal zero or not in the runtime. Being equal to zero means that the `batch_dim` has not yet been initialized (dynamic dimensions in the runtime are set to zero on declaration), which, in turn, means it must be set. If the `batch_dim` is not equal to zero, this means it has already been set, hence must be validated.

Reviewed By: ipiszy

Differential Revision: D43712183

fbshipit-source-id: a4570729bf6ebfba21034330b68c0362f685c72c
Summary: Pull Request resolved: facebookincubator#371

Reviewed By: wushirong

Differential Revision: D43859425

fbshipit-source-id: ed4e568d44a81c52769bf0c43ec775c4ddc88503
Reviewed By: frank-wei

Differential Revision: D43677477

fbshipit-source-id: b916f43bc7170de8bfcfb5468941d7dc82f26524
Summary:
Pull Request resolved: facebookincubator#377

Dropout is a noop at inference. Removed with acc tracer.

Reviewed By: frank-wei, wushirong

Differential Revision: D43881227

fbshipit-source-id: 0246365e6facc6dfb13843fa9854802f35c0938a
Summary:
If the target resolution cannot be divided by 64, the compilation process fails on the UNet step.
This PR asserts the width and height immediately instead of compiling the CLIP model and failing a few minutes in.

closes facebookincubator#345

Pull Request resolved: facebookincubator#355

Reviewed By: tenpercent

Differential Revision: D43784017

Pulled By: muchulee8

fbshipit-source-id: 7ab7581f80c4e649e1afa4a22b53da3aac959c13
Summary:
Updated the softmax wiki link and add images so that the wiki will have link to refer to

Pull Request resolved: facebookincubator#379

Reviewed By: muchulee8

Differential Revision: D43890078

Pulled By: tissue3

fbshipit-source-id: 5893e904c14b684b16fe8601419cba74bf0d50d7
Summary: Pull Request resolved: facebookincubator#372

Reviewed By: jiaqizhai, khabinov

Differential Revision: D43726142

fbshipit-source-id: 90add0c73e9b7725a4a0969fd3ba14ae81d3e481
Summary:
Pull Request resolved: facebookincubator#374

Fallback for input type half2 version of fast_tanh in CUDA_ARCH < 75 case is implemented.

Reviewed By: aakhundov

Differential Revision: D43871666

fbshipit-source-id: 5e9bed21996eb9cd5e71fdb3851e7ab9d20826cb
Summary:
Pull Request resolved: facebookincubator#370

Turns out, `JaggedIntVar` wasn't hashable. This created problems for some passes (e.g., [here](https://github.com/facebookincubator/AITemplate/blob/75f54510d8e02114e013200a66ea9a5d433e5f81/python/aitemplate/compiler/transform/transform_strided_op_and_view_op.py#L44-L48)).

This diff adds a `__hash__` function to `JaggedIntVar`. And because it should pretend to be a regular `IntVar` by default, the new `__hash__` function has the structure of the `IntVar.__hash__`.

Reviewed By: ipiszy

Differential Revision: D43857198

fbshipit-source-id: dc569e02731ae07aa522ad06d45d4b2f8893d336
Summary:
Pull Request resolved: facebookincubator#380

In this diff, `jagged_to_dense` front-end and back-end op is added with the vectorized I/O. We reuse many of the utilities in `testing/jagged_utils.py` similar to `backend/common/elementwise_common.py`  in D43482363. A unit test and benchmark are included.

## Implementation Details
Since the output is dense, we adopt the calculations based on dense shape and apply padding when current element is outside of jagged shape from input.

Reviewed By: aakhundov

Differential Revision: D43562375

fbshipit-source-id: 930ad6793a9c6260497847330abd0a83e5861ac9
Summary:
Pull Request resolved: facebookincubator#368

Cleaned up code for expand op:

* Added more documentation comments & type hints
* Improved variable & function naming
* Simplified code ( eliminated potentially unneccessarily specialized kernels )

Reviewed By: chenyang78

Differential Revision: D43844913

fbshipit-source-id: 3734e1b47d108398d5e1513e301a193e54839dc9
Summary:
Pull Request resolved: facebookincubator#354

Applied a one-off refactoring script, to change all relative imports within AITemplate to absolute imports. Then ran arc lint to make sure formatting is correct.

### Why?

IDEs like VSCode or PyCharm have problems resolving the paths to packages imported via relative imports, as they don't know the basepath. Now we can navigate to all imported symbols using CMD+click on the symbol.

Here is the script. It is intended for one-off usage, so I did not bother with code style or reusability.

```
import os
import re
from pathlib import Path

def process_relative_imports(path, basepath, basepackage):
    path = os.path.abspath(str(path))
    basepath = os.path.abspath(str(basepath))
    if not path.startswith(basepath):
        return
    relpath = path[len(basepath) :].strip("/")
    pparts = relpath.split("/")

    def dot_replacer(match):
        dots = match.group(2)
        pkg = basepackage + ".".join(pparts[: -len(dots)])
        pkg = pkg.strip(".")
        replacement = match.group(1) + pkg + "." + match.group(3)
        return replacement.replace("..", ".")

    with open(path, "rt", encoding="utf8") as f:
        contents = f.read()
        rcontents = re.sub(
            r"(^from )(\.+)([^\.].*import .*$)",
            dot_replacer,
            contents,
            flags=re.MULTILINE,
        )
    with open(path, "wt", encoding="utf8") as f:
        f.write(rcontents)
    print(f"Wrote {path}")

if __name__ == "__main__":
    allpyfiles = [str(path) for path in Path(".").rglob("*.py")]
    for p in allpyfiles:
        print(p)
        if p.endswith("extra_cutlass_generator.py"):
            continue
        process_relative_imports(p, ".", "aitemplate.")

```

Reviewed By: ipiszy, chenyang78, tenpercent

Differential Revision: D43715713

fbshipit-source-id: 1c2eaaaadc2f1edf8f4e378bc2781c5f851e80ba
Summary:
Pull Request resolved: facebookincubator#382

ATT

Reviewed By: alexanderguzhva

Differential Revision: D43920370

fbshipit-source-id: b387815948bff5b8791069c37683df8f3ff7273b
…tor#383)

Summary:
Pull Request resolved: facebookincubator#383

ATT

Reviewed By: wushirong

Differential Revision: D43923044

fbshipit-source-id: 77a21ddf9a1a11180f9bde2b132dca43964e2a88
…00 to V100 (facebookincubator#384)

Summary:
Pull Request resolved: facebookincubator#384

ATT

Reviewed By: wushirong

Differential Revision: D43924250

fbshipit-source-id: 7b438ccc420d99352855b0e69088184df075afe2
…#381)

Summary:
Pull Request resolved: facebookincubator#381

Setting AIT_PLOT_SHORTEN_TENSOR_NAMES=1 environment variable makes AITemplate to replace tensors with long names with shortened names (like URL shortener does) during building a plot.

Reviewed By: chenyang78

Differential Revision: D43918759

fbshipit-source-id: d820dfae8cbfdd5c9e0ac750709736a17a94ceeb
…graph in a third-party python code. (facebookincubator#388)

Summary:
Pull Request resolved: facebookincubator#388

`_graph.json` files will be generated in addition to `_graph.txt` files under the same circumstances. Such a file can be loaded using `json.loads()` call.

Reviewed By: chenyang78

Differential Revision: D43951586

fbshipit-source-id: 392ee5b43f4746f428a1d92ba2bcc5ab4cbf11bb
Summary:
Pull Request resolved: facebookincubator#389

att

Reviewed By: chenyang78

Differential Revision: D43953467

fbshipit-source-id: 61dc27f91210bdf0984f6b0ba3645bd9daeed819
Summary:
Pull Request resolved: facebookincubator#385

Reduce assignment of constants that are not necessary.

Reviewed By: khabinov, morgendave, wushirong

Differential Revision: D43923768

fbshipit-source-id: 1ec6869dfa01964cd4ac0c3cdd7600c604ade9d5
Summary:
Pull Request resolved: facebookincubator#392

We had int_elementwise support  for dynamic shape in aten2ait, but didn't add it to fx2ait. Fx2ait were able to calculate static shape, but recently IFR model requests dynamic shape calculation: https://fburl.com/code/7eag5h8a

Therefore added the support.

Reviewed By: khabinov, wushirong

Differential Revision: D43964418

fbshipit-source-id: 32e64e18e1acd1f6152b6448361fd472e4dbfe8d
hl475 and others added 30 commits April 7, 2023 10:52
…cubator#546)

Summary:
Pull Request resolved: facebookincubator#546

This would help to reduce the test duration

Reviewed By: houseroad

Differential Revision: D44782015

fbshipit-source-id: 3f4ce8d3bb07766eaef866ec19d41ba990ae5b38
…cebookincubator#548)

Summary:
Pull Request resolved: facebookincubator#548

This would help to reduce the test duration

Reviewed By: houseroad

Differential Revision: D44782077

fbshipit-source-id: bc69c15b67de543ce3c23fabf205c8050b646590
…bator#545)

Summary:
Pull Request resolved: facebookincubator#545

This would help to reduce the test duration

Reviewed By: houseroad

Differential Revision: D44781395

fbshipit-source-id: 438062d52630e86e3346a4c9ac7f8ed6bcb34d7d
…ncubator#547)

Summary:
Pull Request resolved: facebookincubator#547

This would help to reduce the test duration

Reviewed By: khabinov, houseroad

Differential Revision: D44781804

fbshipit-source-id: 02501b800663bfc5a75f4c684ec9c61ebe8ba750
…bookincubator#549)

Summary:
Pull Request resolved: facebookincubator#549

The `check_sequence_lengths` attribute of the `make_jagged` ops is not carried over in the `dedup_make_jagged_ops` from the old ops to the new one. This is a bug that causes problems in the setting where `check_sequence_lengths` is set to `False` in the existing ops, as the default value is `True`. The diff fixes the bug by carrying over the attribute in the pass.

Reviewed By: amateurcoffee, tissue3

Differential Revision: D44808132

fbshipit-source-id: 5bdd8d8d764cafbf06e0bacfaffe973db2aecf25
Summary:
Pull Request resolved: facebookincubator#544

Previously, the same `Makefile` was used (and reused) for building all different kinds of profilers for different ops and op configurations. This has caused an issue when running parallel unit tests in a local environment, as contention for the same `./tmp/profiler/Makefile` has led to different tests rewriting it before being read by others.

As a result, the tests were building each others' profilers and were left without their own. The latter manifested itself in the following error, as the profiler executable that should have been built wouldn't have been there by the time the compilation would have ended:

```
Profiler ./tmp/profiler/gemm_rcr/gemm_rcr_9e46850d5286ecc7e078b5b7f76afbcac62967b4_3 is not executable
```

In this diff, the built profiler target names are included in the per-profiler `Makefile` name, hence excluding the possibility of different tests rewriting each other profiler `Makefile`s. This resolves the issue and the above error is no longer raised. Importantly, it is acceptable for the tests to rewrite the `Makefile` of the same profiler targets, as the content will also be the same.

Additionally, a few retries (with a delay) are made to check if the profiler binary is executable in the `gemm_universal` front-end. This is to handle the cases where the same binary is being compiled in parallel by more than one unit test, so that by the time one tries to check executability, the other is in process of writing the compiled result.

Reviewed By: kadeng

Differential Revision: D44788627

fbshipit-source-id: 3080fadb7d3114615a49b214bb4bb65abca15ef7
Summary:
Pull Request resolved: facebookincubator#554

`test_make_jagged_dedup` fails in some CircleCI jobs (mostly `main`), due to a minor discrepancy:

```
AssertionError: Tensor-likes are not close!

Mismatched elements: 1 / 224 (0.4%)
Greatest absolute difference: 0.01708984375 at index (0, 16) (up to 0.01 allowed)
Greatest relative difference: 0.03127792672028597 at index (0, 16) (up to 0.01 allowed)
```

The error, probably, accumulates due to the two gemm ops being applied back-to-back in the test. Here we increase the tolerance to `5e-2` to avoid the test failure in CircleCI.

Reviewed By: alexanderguzhva

Differential Revision: D44815979

fbshipit-source-id: 02b73c45487cc5a300e04e4f131a7664bcccb6a4
Summary:
Pull Request resolved: facebookincubator#553

When multiple unit tests are running in parallel, a few can be building the same profiler binary (e.g., for the same op configuration from both tests). In such cases, it may happen that, by the time one test attempts to execute the build profiler binary, another test is in the middle of writing the compilation result. This triggers an error, which before this diff has caused a failure of the async task running profiler commands and eventual profiler timeout.

The diff adds retries to profiler execution, hence remediating the problem described above.

Reviewed By: alexanderguzhva

Differential Revision: D44815907

fbshipit-source-id: c9082e8bc9c59ad1f629373e156ba4661cc89795
Summary:
Pull Request resolved: facebookincubator#550

Currently the fuse parallel gemm pass doesn't check if tensors being fused and eliminated are output tensors. This results in errors like
```
"ValueError: Output output188 was not found in the graph after optimizations."
```
during AIT compilation.

This diff adds the check in to make sure these aren't removed from the optimized graph.

Reviewed By: frank-wei, houseroad

Differential Revision: D44806086

fbshipit-source-id: a1e1f286c5377afe8464aba1cb0c5d7f83de9984
Summary:
Pull Request resolved: facebookincubator#556

There is a bug in the current GEMM profiler's way of using the memory pool: the tensors are requested only once for the entire GEMM kernel's profiling loop. The fact that the same tensors / memory regions / pointers are used in all iterations of the kernel's profiling loop render the memory pool virtually useless. The risk is that small inputs may stick in the GPU's L2 cache, leading to unreliable profiling results.

In this diff we fix the bug by modifying the GEMM back-end profiler templates in a way that the `memory_pool->RequestTensorByIdx(...)` calls are made *within* the profiling loop, hence rotating the inputs for every call and eschewing L2 caching. Experiments with simple GEMM on small problem sizes (e.g., `M=1024, N=512, K=256`) have shown that, after the fix, the runtimes measured in profiling can grow up to 30% for some of the kernels. The selected best kernel can also change as a result.

Reviewed By: tenpercent

Differential Revision: D44816867

fbshipit-source-id: 27259671614422cbe3072d578842b5bc617dc830
Summary: Pull Request resolved: facebookincubator#560

Reviewed By: henryhu6

Differential Revision: D44854358

Pulled By: terrychenism

fbshipit-source-id: a80e704f35aea69ba57c1b0d7bf1785312aa88bf
Summary: Pull Request resolved: facebookincubator#551

Reviewed By: tenpercent

Differential Revision: D44814768

fbshipit-source-id: 71184eeb0c95bafbd853ea4685e2135423c7df8b
…vistor (facebookincubator#552)

Summary:
Pull Request resolved: facebookincubator#552

cutlass::gemm::GemmCoord uses int values as coordinates under the hood, while AIT might use int64_t variables in {M, N, K} constructor. So, narrowing conversion is needed.

Reviewed By: tenpercent

Differential Revision: D44814784

fbshipit-source-id: 521fb91570fea19c4a651e71ea93e2e0c787eb48
…kincubator#530)

Summary:
Pull Request resolved: facebookincubator#530

ATT.
Also updated b2b bmm kernels to support alpha1_divide_by_seq_len.

Reviewed By: aakhundov, kadeng

Differential Revision: D44451037

fbshipit-source-id: dc104bed4edff38d99d2117815d700b516a50c73
Summary:
Pull Request resolved: facebookincubator#563

The `recude_*` ops seem to fail [this assertion](https://github.com/facebookincubator/AITemplate/blob/main/python/aitemplate/backend/cuda/reduce/reduce_small_axis.py#L316) when the last input dimension is `IntVar`. The problem seems to be that the reduction axis is assumed to be -1 in the `_get_read_vector_type` function, even if it's actually not. Hence the check [here](https://github.com/facebookincubator/AITemplate/blob/main/python/aitemplate/backend/cuda/reduce/reduce_small_axis.py#L413) against the actual reduction axis passes, but the subsequent aforementioned assertion fails.

This diff replaces the assertion by using the `input_type` as the `read_vector_type` if the last input dim is `IntVar`, as the `IntVar` reduction dim's value can be odd in the runtime. Instead of failing the assertion the code compilation successfully completes.

Reviewed By: chenyang78

Differential Revision: D44915126

fbshipit-source-id: 34a8d9b8f0b678468ed1e80f4ae56b34aafc1c5e
Summary:
Pull Request resolved: facebookincubator#541

See T148695911

With D44229622 we could prove that it should be possible to speed up unit tests and therefore also CI runs considerably.

The task was to integrate the build cache with Sandcastle CI
in order to speed up our CI process.

For reference about considered options, tradeoffs and decision process:

Original design doc at https://docs.google.com/document/d/1GHuhIJ83CsS3hgB8bV53TDTIqavqpPl4guP_kDcWdII/edit
Final design review meeting slides & notes: https://docs.google.com/presentation/d/1bICc-OtCp1kgisL3SOCN7XYN4ZRn9a6JX62eMjFUI68/edit#slide=id.g1e0053f1f88_0_53

Implementation:

 [x] Created a Manifold-based build cache implementation
 [x] incorporated it into the non-OSS part of the codebase, similar to fb/detect_model.py in fb/build_cache.py
 [x] Sets TTL on stored objects. Resets this TTL on read (  asynchronously, no need to wait for this before continuing )
 [x]Archiving and storing of files to be cached happen asynchronously in order not to delay the tests.
 [x]Investigated whether we can get Manifold latency down by creating a new bucket with different settings ( did not work for me)

 Add features and config options to:

 [x] Disabled caching for a compile_model call, entire unit test or globally ( env var )

 [x]Disabled the build cache for profiling only ( env var )
Not use the cache with a certain probability (in order to keep the build system and cache under test)
I
 [x]Incorporated info from question on Manifold Users Workplace group, whether we can use the official Manifold Client for this usecase ( https://fb.workplace.com/groups/ManifoldUsers/permalink/1682913152123392/ )

(Unless we quickly get an answer, the first implementation should use the deprecated manifold client, because that is proven to work and safe in multiprocessing. )

 [x] Does not cache .obj files ( unneccessary, and takes up large amount of storage in many cases )

 [x] Added unit test ( mock Manifold client )

Reviewed By: ipiszy, aakhundov

Differential Revision: D44642328

fbshipit-source-id: 9d2ec65e953d7f513d4325a7d1cc834f1b5afb75
…or#565)

Summary:
Pull Request resolved: facebookincubator#565

There were reports of corrupted CUTLASS include directories which led to build failures which could only be resolved by manually deleting a directory generated by the FBCUDA target below /tmp. This fix attempts to make the corresponding logic more robust against edge cases and errors, as well as fail early if assertions are violated.

Reviewed By: aakhundov

Differential Revision: D44918599

fbshipit-source-id: e02e8f272ac8c625522c069a98a679383bbff883
Summary:
Pull Request resolved: facebookincubator#562

conv1d can be expressed in terms of conv2d, so I didn't introduce any new kernel, but customized conv2d kernel generation

Reviewed By: terrychenism

Differential Revision: D44894688

fbshipit-source-id: c6e1d8894498302cf43bfe8c07ee9779b94fe3d2
Summary:
Pull Request resolved: facebookincubator#566

Refactoring "arange" tensor used in time embeddings to be model parameter.

Reviewed By: henryhu6

Differential Revision: D44903108

fbshipit-source-id: 227a2d4d2fee126dab02393af71ba35bef82936d
…ator#570)

Summary:
Consider we have a following graph:

  concat_0 = concatenate(x0, x0)
  reshape_1 = reshape(concat_0)
  concat_2 = concat(reshape_1, x1)
  concat_3 = concatenate(concat_0, x2)

Previously, our move_view_ops pass would end up with an infinite loop, because it turned the graph into forms that were always valid for another iteration, e.g.

  (1) after the first iteration:

  concat_0 = concatenate(x0, x0)
  concat_2 = concat(concat_0, x1)
  new_reshape = reshape(concat_2)
  concat_3 = concatenate(new_reshape, x2)

  (2) after the second iteration:

  concat_0 = concatenate(x0, x0)
  new_reshape = reshape(concat_0)
  concat_2 = concat(new_reshape, x1)
  concat_3 = concatenate(concat_0, x2)

  and so on.

  This PR fixed the issue by skipping the pattern.

Pull Request resolved: facebookincubator#570

Reviewed By: hl475

Differential Revision: D44946922

Pulled By: chenyang78

fbshipit-source-id: ff91fef90218feb4679e5b073979a8de02d912a8
Summary:
Pull Request resolved: facebookincubator#516

Symbolic shape support has landed, remove hacks that were used.

Reviewed By: tissue3

Differential Revision: D44482705

fbshipit-source-id: 685c74efa0b4a2cec6a2f963fff4b0437b44a32e
…acebookincubator#559)

Summary:
Pull Request resolved: facebookincubator#559

`_fuse_strided_op_and_cat` pass inside `transform_strided_ops` shouldn't fuse GEMM and concat if concatenation is happening along a dimension >= rank of the original shape. This happens, for example, when GEMM output of shape `(M, N)` is unsqueezed to `(M, N, 1)` and concatenated with another `(M, N, 1)`. Such fusion would require GEMM to write the last dimension into memory in a non-contiguous way, which is not supported for row-major output (only one stride is supported).
However, fusion is possible when unsqueezed dimension is internal - e.g. when final shape is `(M, 1, N)`.
Method `TensorAccessor.is_rightmost_dim_contiguous` checks if fusion is possible based on these criteria.

Reviewed By: tissue3, aakhundov

Differential Revision: D44747795

fbshipit-source-id: 4fbb005ce27d32654bda68f8405ec06b23f17a1a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.