pku-liang / amos Goto Github PK

Automatic Mapping Generation, Verification, and Exploration for ISA-based Spatial Accelerators

License: Apache License 2.0

CMake 0.55% Makefile 0.20% Java 0.82% Shell 0.63% C++ 36.12% RenderScript 0.01% Python 57.27% C 0.92% Objective-C 0.08% Objective-C++ 0.23% Rust 1.34% Go 0.46% Cuda 0.79% HTML 0.01% JavaScript 0.06% TypeScript 0.39% Cython 0.12%

amos's People

Stargazers

Watchers

Forkers

erinh47 fudanemwlab sirius93123 eazure ganler trellixvulnteam smallccn shouyangdong kenny1125nz

amos's Issues

Support for RTX 3060

When running the example in the ReadME with my RTX 3060 GPU, I keep getting ".E.E.E.E.E" build error outputs. Iterations never find a proper solution, so the output keeps repeating things like "Iteration 2: 1e-10/1e-10(10000000000000.0 ms)". Is there something I can do about that? Same thing if I run the test_winograd_cuda file, which gives output like this :

Problem size:
1 256 56 56 512 3 3 1 1
Logging to devnull...
Totally 1 different mappings for this matching
Logging to conv2d-fp16-layer-6-batch-1.log...
Loading from file conv2d-fp16-layer-6-batch-1.log...
Load 0 entries! The best known is 10000000000000.000000 ms
Using arch: sm_86
Total search tirals: 1000 
batch size: 16 
batch num: 63
Search round: 0
.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E

Current best timecost:  10000000000000.0 ms
Search round: 1
.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E

Current best timecost:  10000000000000.0 ms
Search round: 2
.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E

Current best timecost:  10000000000000.0 ms
Search round: 3
.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E

Current best timecost:  10000000000000.0 ms
Search round: 4
.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E

Current best timecost:  10000000000000.0 ms
Search round: 5
.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E

Current best timecost:  10000000000000.0 ms
Search round: 6
.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E

Current best timecost:  10000000000000.0 ms
Search round: 7
.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E

Current best timecost:  10000000000000.0 ms
Search round: 8
.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E

Current best timecost:  10000000000000.0 ms
Search round: 9
.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E

Current best timecost:  10000000000000.0 ms
Search round: 10
.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E

Current best timecost:  10000000000000.0 ms
Search round: 11
It seems hard to find new candidates...
python-BaseException
Traceback (most recent call last):
  File "/home/vdkhvb/Documents/applications/pycharm/plugins/python-ce/helpers/pydev/pydevd.py", line 1483, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/home/vdkhvb/Documents/applications/pycharm/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/vdkhvb/Documents/git/AMOS/tests/python/auto_tensorize/test_winograd_cuda.py", line 254, in <module>
    cost = run(
  File "/home/vdkhvb/Documents/git/AMOS/tests/python/auto_tensorize/test_winograd_cuda.py", line 190, in run
    return tensorize_tensorcore_fp16fp16(
  File "/home/vdkhvb/Documents/git/AMOS/tests/python/auto_tensorize/test_winograd_cuda.py", line 166, in tensorize_tensorcore_fp16fp16
    result = at.auto_tensorize_v2(
  File "/home/vdkhvb/Documents/git/AMOS/python/tvm/auto_tensorize/auto_tensorize.py", line 365, in auto_tensorize_v2
    value, params = find_optimized_parameters(
  File "/home/vdkhvb/Documents/git/AMOS/python/tvm/auto_tensorize/search/parameter.py", line 428, in find_optimized_parameters
    params = schedule_gen.get_next(policy=policy)
  File "/home/vdkhvb/Documents/git/AMOS/python/tvm/auto_tensorize/search/parameter.py", line 383, in get_next
    return next(self.gen)
  File "/home/vdkhvb/Documents/git/AMOS/python/tvm/auto_tensorize/search/parameter.py", line 344, in _get_next
    yield self.get(repeat=repeat)
  File "/home/vdkhvb/Documents/git/AMOS/python/tvm/auto_tensorize/search/parameter.py", line 250, in get
    return self.entries[0].record
IndexError: list index out of range

Process finished with exit code 1

Some Qs about implementation

AMOS represents an innovative approach that leverages automated Mapping Generation and performance optimization to enhance the utilization of emerging hardware units like TensorCore. I have encountered some implementation challenges that I seek guidance on.

in computing compute latency, the intrinsic latency, a fixed value, can be approximated using hardware models. The resulting latency is then multiplied by the trip counts of sequential loops, which operate in a sequential manner not tethered to parallel cores. An inquiry arises: why is this sequencing necessary?
Operations like tiling, fusion, and other scheduling actions typically precede tensorization, leading to the generation of parallel code. Moreover, scheduling adjustments may introduce variations in the number of software iterations. How should this fluctuation be addressed, and what is the current efficacy of the mapping generation process?

Difference performance with simple_mode enabled?

Hi all, could you kindly introduce the difference between auto-tensorize and auto-tensorize-v4 ? from the observation of amos-gemm benchmarking, the performance of this two strategies is quite resemblance

M	K	N	amos-1000-step-fp16-simple(ms)	amos-1000-step-fp16(ms)
2	2	2	Failed to Run	Failed to Run
4	4	4	Failed to Run	Failed to Run
8	8	8	Failed to Run	Failed to Run
16	16	16	0.004545906	0.003936828
32	32	32	0.004610093	0.004310548
64	64	64	0.004638971	0.004614832
128	128	128	0.005128772	0.005059945
256	256	256	0.006975747	0.007367229
512	512	512	0.018055338	0.016287096
1024	1024	1024	0.066839093	0.071785023
2048	2048	2048	0.382059749	0.336489417
4096	4096	4096	2.00519422	2.252330443
8192	8192	8192	21.62599663	18.10944683
16384	16384	16384	111.4660256	132.6751751

Fail to run gemm_bench with Problem size (2,2,2) (4,4,4) (8,8,8)

Hi there, I'm currently benchmarking gemm perfomance of amos on tensorcore, I modified the mapping_gemm_tensorcore as below:

import tvm
import os
from tvm import auto_tensorize as at
import argparse


def gemm(M, N, K, in_dtype, out_dtype):
    A = tvm.te.placeholder([M, K], dtype=in_dtype, name="A")
    B = tvm.te.placeholder([K, N], dtype=in_dtype, name="B")

    rk = tvm.te.reduce_axis([0, K], name="k")
    C = tvm.te.compute(
        [M, N], lambda i, j: tvm.te.sum((A[i, rk] * B[rk, j]).astype(out_dtype), axis=rk), name="C"
    )
    return [A, B, C]


def mapping_tensorcore(
    M,
    N,
    K,
    layer,
    in_dtype,
    out_dtype,
    simple_mode=True,
    trials=-1,
    verbose=False,
    use_perf_model=False,
    perf_model_ratio=0.6,
):
    A, B, Gemm = gemm(M, N, K, in_dtype, out_dtype)
    target_dag = at.compute_dag_from_tensors([Gemm])
    target = "cuda"

    log_dir = "gemm-%s-%s-layer-%s" % (in_dtype, out_dtype, layer)
    log_file = "gemm-%s-%s-layer-%s.log" % (in_dtype, out_dtype, layer)

    measure_opt = at.MeasureOptions(target=target, timeout=100, number=200, min_repeat_ms=500)

    if simple_mode:
        trials = 1000 if trials < 0 else trials
        result = at.auto_tensorize(
            target_dag, target, log_file, measure_opt, trials=trials, verbose=verbose
        )
        if not result.defined():
            print("Can't do tensorize.")
            return
        schedule_gen = result.sch_gen
        schedule_app = result.sch_app

        # load from file
        schedule_gen.load_from_file(log_file, clear=True)
        entry = schedule_gen.get_best_entry()
        # we store 1/time_cost in file
        params, value = entry.record, 1 / entry.value
        print(value)
        print(params.to_json())
    else:
        trials = 4000 if trials < 0 else trials
        result = at.auto_tensorize_v4(
            target_dag,
            target,
            log_file,
            measure_opt,
            schedule_log_dir=log_dir,
            trials=trials,
            search_group_size=5,
            transform_dump=verbose,
            enable_perf_model=use_perf_model,
            perf_percentage=perf_model_ratio,
        )
        if not result.defined():
            print("Can't do tensorize.")
            return
        schedule_gen = result.sch_gen
        schedule_app = result.sch_app

        # we store 1/time_cost in file
        params, value = result.params, result.perf
        print(value)
        print(params.to_json())

    cost = at.evaluate_params(schedule_app, params, measure_opt, dump=verbose)
    print("Cost of %s is %f ms" % (log_dir, cost))
    return cost


shapes = [(16, 512, 128), (1024, 16, 256), (256, 1024, 256), (512, 256, 16), (1024, 1024, 1024)]

supported_dtypes = set(
    [
        ("float16", "float16"),
        ("float16", "float32"),
        ("bfloat16", "float32"),
        ("float32", "float32"),
        ("float64", "float64"),
        ("int4", "int32"),
        ("int8", "int32"),
    ]
)

example_text = """
 example:
    python mapping_gemm_tensorcore.py --in_dtype float16 --out_dtype float16 --begin 0 --num 1 --trials 20
    python mapping_gemm_tensorcore.py --in_dtype float16 --out_dtype float32 --begin 0 --num 1 --trials 20
    python mapping_gemm_tensorcore.py --in_dtype float32 --out_dtype float32 --begin 0 --num 1 --trials 20
    python mapping_gemm_tensorcore.py --in_dtype float16 --out_dtype float16 --begin 0 --num 1 --trials 400 --simple_mode 0
"""

if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        prog="base_maker",
        description="template maker",
        epilog=example_text,
        formatter_class=argparse.RawDescriptionHelpFormatter,
    )
    parser.add_argument(
        "--in_dtype",
        type=str,
        choices=["float16", "float32", "float64", "bfloat16", "int4", "int8"],
        default="float16",
    )
    parser.add_argument(
        "--out_dtype",
        type=str,
        choices=["float16", "float32", "float64", "int32"],
        default="float16",
    )
    parser.add_argument("--begin", type=int, choices=list(range(len(shapes))), default=0)
    parser.add_argument(
        "--num", type=int, choices=list(range(1, len(shapes) + 1)), default=len(shapes)
    )
    parser.add_argument("--simple_mode", type=int, default=1, choices=[0, 1])
    parser.add_argument("--trials", type=int, default=-1)
    parser.add_argument("--verbose", action="store_true")
    parser.add_argument("--use_perf_model", action="store_true")
    parser.add_argument("--perf_model_ratio", type=float, default=0.6)

    args = parser.parse_args()
    assert 0 < args.perf_model_ratio <= 1.0
    if args.use_perf_model:
        assert args.simple_mode == 0, "Performance model is only supported without simple_mode"
    beg = args.begin
    num = args.num
    print(args.simple_mode)
    assert (
        args.in_dtype,
        args.out_dtype,
    ) in supported_dtypes, (
        f"The desired dtype pair {(args.in_dtype, args.out_dtype)} is not supported by Tensor Core."
    )
    costs = []
    for i, shape in enumerate(shapes[beg : beg + num]):
        (M, N, K) = shape
        print("\n\nProblem size:")
        print(M, N, K)
        layer_name = f"({M}, {N}, {K})"
        try:
            cost = mapping_tensorcore(
                M,
                N,
                K,
                layer_name,
                args.in_dtype,
                args.out_dtype,
                simple_mode=args.simple_mode,
                trials=args.trials,
                verbose=args.verbose,
                use_perf_model=args.use_perf_model,
                perf_model_ratio=args.perf_model_ratio,
            )
            costs.append(cost)
        except Exception as e:
            print("Fail to run\n", str(e))
            costs.append(float("inf"))
    for cost in costs:
        print(cost)

just with shaped customized, but this bench throw a Fail to run error.

0


Problem size:
2 2 2
Possible matchings:
0 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:nnn, shape:16x16x16)
1 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:nnn, shape:32x8x16)
2 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:nnn, shape:8x32x16)
3 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ntn, shape:16x16x16)
4 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ntn, shape:32x8x16)
5 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ntn, shape:8x32x16)
6 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:tnn, shape:16x16x16)
7 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:tnn, shape:32x8x16)
8 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:tnn, shape:8x32x16)
9 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ttn, shape:16x16x16)
10 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ttn, shape:32x8x16)
11 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ttn, shape:8x32x16)
Logging to devnull...
Totally 1 different mappings for this matching
Logging to devnull...
Totally 1 different mappings for this matching
Catch an infeasible mapping:
{"vmap": [[1], -1]}
Fail to run



Problem size:
4 4 4
Possible matchings:
0 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:nnn, shape:16x16x16)
1 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:nnn, shape:32x8x16)
2 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:nnn, shape:8x32x16)
3 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ntn, shape:16x16x16)
4 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ntn, shape:32x8x16)
5 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ntn, shape:8x32x16)
6 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:tnn, shape:16x16x16)
7 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:tnn, shape:32x8x16)
8 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:tnn, shape:8x32x16)
9 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ttn, shape:16x16x16)
10 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ttn, shape:32x8x16)
11 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ttn, shape:8x32x16)
Logging to devnull...
Totally 1 different mappings for this matching
Logging to devnull...
Totally 1 different mappings for this matching
Catch an infeasible mapping:
{"vmap": [[1], -1]}
Fail to run



Problem size:
8 8 8
Possible matchings:
0 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:nnn, shape:16x16x16)
1 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:nnn, shape:32x8x16)
2 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:nnn, shape:8x32x16)
3 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ntn, shape:16x16x16)
4 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ntn, shape:32x8x16)
5 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ntn, shape:8x32x16)
6 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:tnn, shape:16x16x16)
7 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:tnn, shape:32x8x16)
8 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:tnn, shape:8x32x16)
9 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ttn, shape:16x16x16)
10 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ttn, shape:32x8x16)
11 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ttn, shape:8x32x16)
Logging to devnull...
Totally 1 different mappings for this matching
Logging to devnull...
Totally 1 different mappings for this matching
Catch an infeasible mapping:
{"vmap": [[1], -1]}
Fail to run

but (16, 16, 16) can do well, any suggestions?

How to tune on X86 cpu?

Hi, Thanks for sharing your excellent work! I'm trying to perform auto-tensorization on x86 CPUs. Is there any tutorial about how to tune on this platform? I tried to modify the script that was originally designed for conv2d on Tensor Core, but it results in no valid solution. All I have is "LLVM ERROR: Do not know how to split the result of this operator!". I also ran the 6th test case of "tests/python/auto_tensorize/test_auto_schedule.py", and it works correctly. So I think the problem is not about the code generation procedure. Environment: Intel(R) Xeon(R) Gold 6314U CPU, LLVM 8.0.0 , gcc 9.3.0.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.