[nnc][perf] 50x performance regression from `_jit_override_can_fuse_on_cpu(True)`

I am seeing some big performance regressions from two benchmarks in TorchBench when I enable torch._C._jit_override_can_fuse_on_cpu(True). LearningToPaint has the performance regression both with and without freezing, while pytorch_mobilenet_v3 only has the regression when I enable freezing as well (in this case a 922x regression).

Code to reproduce:

import torch, re, multiprocessing
from timeit import timeit
from torchbenchmark.models import LearningToPaint, pytorch_mobilenet_v3


def measure(fuse, benchmark_module, warmup=1, number=100):
    torch.set_num_threads(1)
    torch._C._jit_override_can_fuse_on_cpu(fuse)
    name = re.sub(r"^.*[.]", "", benchmark_module.__name__)
    benchmark = benchmark_module.Model(device="cpu", jit=True)
    model, example_inputs = benchmark.get_module()
    assert isinstance(model, torch.jit.ScriptModule)

    model = model.eval()
    timeit(lambda: model(*example_inputs), number=warmup)
    print(f"    script({name:20})         = {timeit(lambda: model(*example_inputs), number=number):.3f} sec")

    model = torch.jit.freeze(model)
    timeit(lambda: model(*example_inputs), number=warmup)
    print(f"    freeze(script({name:20})) = {timeit(lambda: model(*example_inputs), number=number):.3f} sec")


for fuse in (False, True):
    print(f"_jit_override_can_fuse_on_cpu({fuse}):")
    for benchmark_module in (LearningToPaint, pytorch_mobilenet_v3):
        # Doing a subproc to ensure we aren't running cached code
        p = multiprocessing.Process(target=measure, args=(fuse, benchmark_module))
        p.start()
        p.join()

Output:

_jit_override_can_fuse_on_cpu(False):
    script(LearningToPaint     )         = 5.862 sec
    freeze(script(LearningToPaint     )) = 5.131 sec
    script(pytorch_mobilenet_v3)         = 1.089 sec
    freeze(script(pytorch_mobilenet_v3)) = 0.460 sec
_jit_override_can_fuse_on_cpu(True):
    script(LearningToPaint     )         = 275.655 sec
    freeze(script(LearningToPaint     )) = 274.557 sec
    script(pytorch_mobilenet_v3)         = 1.108 sec
    freeze(script(pytorch_mobilenet_v3)) = 424.520 sec

I am running on PyTorch==96fd5d87f7c739fd62b57b6cb08fd41a60d0bf40 on Ubuntu 20.10/gcc-9.3.

cc @gmagogsfm

3 thoughts on “[nnc][perf] 50x performance regression from `_jit_override_can_fuse_on_cpu(True)`

  1. For the current behavior, I’d suggest _jit_override_can_fuse_on_cpu(True) throws an error if you don’t have LLVM. Once fusing becomes the default maybe we need to switch to the NO_LLVM flag.

  2. I had a chat with @bertmaher and his suggestion might work. We have a flag te_must_use_llvm_on_cpu which is off by default and is used in TE to enforce LLVM backend. The plan is to turn this flag on by default, which would ensure that we error out when CPU fusion is enabled without LLVM. And we can turn this flag off for all tests. Trying this now.

    What are they testing if LLVM is disabled?

    It is still useful to test without LLVM, since we can then test other parts of our code base without LLVM codegen. However, I agree that these tests should be run with LLVM enabled as well.

    If we switch to NO_LLVM, how would we do (2)?

    NO_LLVM will be in addition to USE_LLVM. The purpose of USE_LLVM will be to just point to the LLVM build.

    Anyways, I realize that changing the default build options is a much bigger impact change. If we decide to go that route, that probably has to be done with proper messaging.

  3. Can we enable LLVM in CI

    I believe we have one or several builds with LLVM enabled in CI. Majority of the builds are done without LLVM though.

    Yeah this is correct, it’s on for some but not all builds. I think in the current state where the llvm fuser is off-by-default it makes sense to have some covered each way (mainly to catch build errors that could creep in — I don’t think we should care about testing the TE interpreter from Python; the interpreter mainly exists to support IR simplification).