pku-liang / amos Goto Github PK
View Code? Open in Web Editor NEWAutomatic Mapping Generation, Verification, and Exploration for ISA-based Spatial Accelerators
License: Apache License 2.0
Automatic Mapping Generation, Verification, and Exploration for ISA-based Spatial Accelerators
License: Apache License 2.0
When running the example in the ReadME with my RTX 3060 GPU, I keep getting ".E.E.E.E.E"
build error outputs. Iterations never find a proper solution, so the output keeps repeating things like "Iteration 2: 1e-10/1e-10(10000000000000.0 ms)"
. Is there something I can do about that? Same thing if I run the test_winograd_cuda file, which gives output like this :
Problem size:
1 256 56 56 512 3 3 1 1
Logging to devnull...
Totally 1 different mappings for this matching
Logging to conv2d-fp16-layer-6-batch-1.log...
Loading from file conv2d-fp16-layer-6-batch-1.log...
Load 0 entries! The best known is 10000000000000.000000 ms
Using arch: sm_86
Total search tirals: 1000
batch size: 16
batch num: 63
Search round: 0
.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E
Current best timecost: 10000000000000.0 ms
Search round: 1
.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E
Current best timecost: 10000000000000.0 ms
Search round: 2
.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E
Current best timecost: 10000000000000.0 ms
Search round: 3
.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E
Current best timecost: 10000000000000.0 ms
Search round: 4
.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E
Current best timecost: 10000000000000.0 ms
Search round: 5
.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E
Current best timecost: 10000000000000.0 ms
Search round: 6
.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E
Current best timecost: 10000000000000.0 ms
Search round: 7
.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E
Current best timecost: 10000000000000.0 ms
Search round: 8
.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E
Current best timecost: 10000000000000.0 ms
Search round: 9
.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E
Current best timecost: 10000000000000.0 ms
Search round: 10
.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E.E
Current best timecost: 10000000000000.0 ms
Search round: 11
It seems hard to find new candidates...
python-BaseException
Traceback (most recent call last):
File "/home/vdkhvb/Documents/applications/pycharm/plugins/python-ce/helpers/pydev/pydevd.py", line 1483, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "/home/vdkhvb/Documents/applications/pycharm/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/home/vdkhvb/Documents/git/AMOS/tests/python/auto_tensorize/test_winograd_cuda.py", line 254, in <module>
cost = run(
File "/home/vdkhvb/Documents/git/AMOS/tests/python/auto_tensorize/test_winograd_cuda.py", line 190, in run
return tensorize_tensorcore_fp16fp16(
File "/home/vdkhvb/Documents/git/AMOS/tests/python/auto_tensorize/test_winograd_cuda.py", line 166, in tensorize_tensorcore_fp16fp16
result = at.auto_tensorize_v2(
File "/home/vdkhvb/Documents/git/AMOS/python/tvm/auto_tensorize/auto_tensorize.py", line 365, in auto_tensorize_v2
value, params = find_optimized_parameters(
File "/home/vdkhvb/Documents/git/AMOS/python/tvm/auto_tensorize/search/parameter.py", line 428, in find_optimized_parameters
params = schedule_gen.get_next(policy=policy)
File "/home/vdkhvb/Documents/git/AMOS/python/tvm/auto_tensorize/search/parameter.py", line 383, in get_next
return next(self.gen)
File "/home/vdkhvb/Documents/git/AMOS/python/tvm/auto_tensorize/search/parameter.py", line 344, in _get_next
yield self.get(repeat=repeat)
File "/home/vdkhvb/Documents/git/AMOS/python/tvm/auto_tensorize/search/parameter.py", line 250, in get
return self.entries[0].record
IndexError: list index out of range
Process finished with exit code 1
AMOS represents an innovative approach that leverages automated Mapping Generation and performance optimization to enhance the utilization of emerging hardware units like TensorCore. I have encountered some implementation challenges that I seek guidance on.
Hi all, could you kindly introduce the difference between auto-tensorize and auto-tensorize-v4 ? from the observation of amos-gemm benchmarking, the performance of this two strategies is quite resemblance
M | K | N | amos-1000-step-fp16-simple(ms) | amos-1000-step-fp16(ms) |
---|---|---|---|---|
2 | 2 | 2 | Failed to Run | Failed to Run |
4 | 4 | 4 | Failed to Run | Failed to Run |
8 | 8 | 8 | Failed to Run | Failed to Run |
16 | 16 | 16 | 0.004545906 | 0.003936828 |
32 | 32 | 32 | 0.004610093 | 0.004310548 |
64 | 64 | 64 | 0.004638971 | 0.004614832 |
128 | 128 | 128 | 0.005128772 | 0.005059945 |
256 | 256 | 256 | 0.006975747 | 0.007367229 |
512 | 512 | 512 | 0.018055338 | 0.016287096 |
1024 | 1024 | 1024 | 0.066839093 | 0.071785023 |
2048 | 2048 | 2048 | 0.382059749 | 0.336489417 |
4096 | 4096 | 4096 | 2.00519422 | 2.252330443 |
8192 | 8192 | 8192 | 21.62599663 | 18.10944683 |
16384 | 16384 | 16384 | 111.4660256 | 132.6751751 |
Hi there, I'm currently benchmarking gemm perfomance of amos on tensorcore, I modified the mapping_gemm_tensorcore
as below:
import tvm
import os
from tvm import auto_tensorize as at
import argparse
def gemm(M, N, K, in_dtype, out_dtype):
A = tvm.te.placeholder([M, K], dtype=in_dtype, name="A")
B = tvm.te.placeholder([K, N], dtype=in_dtype, name="B")
rk = tvm.te.reduce_axis([0, K], name="k")
C = tvm.te.compute(
[M, N], lambda i, j: tvm.te.sum((A[i, rk] * B[rk, j]).astype(out_dtype), axis=rk), name="C"
)
return [A, B, C]
def mapping_tensorcore(
M,
N,
K,
layer,
in_dtype,
out_dtype,
simple_mode=True,
trials=-1,
verbose=False,
use_perf_model=False,
perf_model_ratio=0.6,
):
A, B, Gemm = gemm(M, N, K, in_dtype, out_dtype)
target_dag = at.compute_dag_from_tensors([Gemm])
target = "cuda"
log_dir = "gemm-%s-%s-layer-%s" % (in_dtype, out_dtype, layer)
log_file = "gemm-%s-%s-layer-%s.log" % (in_dtype, out_dtype, layer)
measure_opt = at.MeasureOptions(target=target, timeout=100, number=200, min_repeat_ms=500)
if simple_mode:
trials = 1000 if trials < 0 else trials
result = at.auto_tensorize(
target_dag, target, log_file, measure_opt, trials=trials, verbose=verbose
)
if not result.defined():
print("Can't do tensorize.")
return
schedule_gen = result.sch_gen
schedule_app = result.sch_app
# load from file
schedule_gen.load_from_file(log_file, clear=True)
entry = schedule_gen.get_best_entry()
# we store 1/time_cost in file
params, value = entry.record, 1 / entry.value
print(value)
print(params.to_json())
else:
trials = 4000 if trials < 0 else trials
result = at.auto_tensorize_v4(
target_dag,
target,
log_file,
measure_opt,
schedule_log_dir=log_dir,
trials=trials,
search_group_size=5,
transform_dump=verbose,
enable_perf_model=use_perf_model,
perf_percentage=perf_model_ratio,
)
if not result.defined():
print("Can't do tensorize.")
return
schedule_gen = result.sch_gen
schedule_app = result.sch_app
# we store 1/time_cost in file
params, value = result.params, result.perf
print(value)
print(params.to_json())
cost = at.evaluate_params(schedule_app, params, measure_opt, dump=verbose)
print("Cost of %s is %f ms" % (log_dir, cost))
return cost
shapes = [(16, 512, 128), (1024, 16, 256), (256, 1024, 256), (512, 256, 16), (1024, 1024, 1024)]
supported_dtypes = set(
[
("float16", "float16"),
("float16", "float32"),
("bfloat16", "float32"),
("float32", "float32"),
("float64", "float64"),
("int4", "int32"),
("int8", "int32"),
]
)
example_text = """
example:
python mapping_gemm_tensorcore.py --in_dtype float16 --out_dtype float16 --begin 0 --num 1 --trials 20
python mapping_gemm_tensorcore.py --in_dtype float16 --out_dtype float32 --begin 0 --num 1 --trials 20
python mapping_gemm_tensorcore.py --in_dtype float32 --out_dtype float32 --begin 0 --num 1 --trials 20
python mapping_gemm_tensorcore.py --in_dtype float16 --out_dtype float16 --begin 0 --num 1 --trials 400 --simple_mode 0
"""
if __name__ == "__main__":
parser = argparse.ArgumentParser(
prog="base_maker",
description="template maker",
epilog=example_text,
formatter_class=argparse.RawDescriptionHelpFormatter,
)
parser.add_argument(
"--in_dtype",
type=str,
choices=["float16", "float32", "float64", "bfloat16", "int4", "int8"],
default="float16",
)
parser.add_argument(
"--out_dtype",
type=str,
choices=["float16", "float32", "float64", "int32"],
default="float16",
)
parser.add_argument("--begin", type=int, choices=list(range(len(shapes))), default=0)
parser.add_argument(
"--num", type=int, choices=list(range(1, len(shapes) + 1)), default=len(shapes)
)
parser.add_argument("--simple_mode", type=int, default=1, choices=[0, 1])
parser.add_argument("--trials", type=int, default=-1)
parser.add_argument("--verbose", action="store_true")
parser.add_argument("--use_perf_model", action="store_true")
parser.add_argument("--perf_model_ratio", type=float, default=0.6)
args = parser.parse_args()
assert 0 < args.perf_model_ratio <= 1.0
if args.use_perf_model:
assert args.simple_mode == 0, "Performance model is only supported without simple_mode"
beg = args.begin
num = args.num
print(args.simple_mode)
assert (
args.in_dtype,
args.out_dtype,
) in supported_dtypes, (
f"The desired dtype pair {(args.in_dtype, args.out_dtype)} is not supported by Tensor Core."
)
costs = []
for i, shape in enumerate(shapes[beg : beg + num]):
(M, N, K) = shape
print("\n\nProblem size:")
print(M, N, K)
layer_name = f"({M}, {N}, {K})"
try:
cost = mapping_tensorcore(
M,
N,
K,
layer_name,
args.in_dtype,
args.out_dtype,
simple_mode=args.simple_mode,
trials=args.trials,
verbose=args.verbose,
use_perf_model=args.use_perf_model,
perf_model_ratio=args.perf_model_ratio,
)
costs.append(cost)
except Exception as e:
print("Fail to run\n", str(e))
costs.append(float("inf"))
for cost in costs:
print(cost)
just with shaped customized, but this bench throw a Fail to run
error.
0
Problem size:
2 2 2
Possible matchings:
0 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:nnn, shape:16x16x16)
1 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:nnn, shape:32x8x16)
2 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:nnn, shape:8x32x16)
3 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ntn, shape:16x16x16)
4 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ntn, shape:32x8x16)
5 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ntn, shape:8x32x16)
6 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:tnn, shape:16x16x16)
7 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:tnn, shape:32x8x16)
8 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:tnn, shape:8x32x16)
9 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ttn, shape:16x16x16)
10 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ttn, shape:32x8x16)
11 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ttn, shape:8x32x16)
Logging to devnull...
Totally 1 different mappings for this matching
Logging to devnull...
Totally 1 different mappings for this matching
Catch an infeasible mapping:
{"vmap": [[1], -1]}
Fail to run
Problem size:
4 4 4
Possible matchings:
0 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:nnn, shape:16x16x16)
1 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:nnn, shape:32x8x16)
2 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:nnn, shape:8x32x16)
3 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ntn, shape:16x16x16)
4 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ntn, shape:32x8x16)
5 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ntn, shape:8x32x16)
6 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:tnn, shape:16x16x16)
7 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:tnn, shape:32x8x16)
8 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:tnn, shape:8x32x16)
9 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ttn, shape:16x16x16)
10 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ttn, shape:32x8x16)
11 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ttn, shape:8x32x16)
Logging to devnull...
Totally 1 different mappings for this matching
Logging to devnull...
Totally 1 different mappings for this matching
Catch an infeasible mapping:
{"vmap": [[1], -1]}
Fail to run
Problem size:
8 8 8
Possible matchings:
0 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:nnn, shape:16x16x16)
1 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:nnn, shape:32x8x16)
2 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:nnn, shape:8x32x16)
3 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ntn, shape:16x16x16)
4 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ntn, shape:32x8x16)
5 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ntn, shape:8x32x16)
6 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:tnn, shape:16x16x16)
7 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:tnn, shape:32x8x16)
8 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:tnn, shape:8x32x16)
9 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ttn, shape:16x16x16)
10 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ttn, shape:32x8x16)
11 : MatchResult(hw_abs_dag:wmma_fp16_fp16, compute:ttn, shape:8x32x16)
Logging to devnull...
Totally 1 different mappings for this matching
Logging to devnull...
Totally 1 different mappings for this matching
Catch an infeasible mapping:
{"vmap": [[1], -1]}
Fail to run
but (16, 16, 16) can do well, any suggestions?
Hi, Thanks for sharing your excellent work! I'm trying to perform auto-tensorization on x86 CPUs. Is there any tutorial about how to tune on this platform? I tried to modify the script that was originally designed for conv2d on Tensor Core, but it results in no valid solution. All I have is "LLVM ERROR: Do not know how to split the result of this operator!". I also ran the 6th test case of "tests/python/auto_tensorize/test_auto_schedule.py", and it works correctly. So I think the problem is not about the code generation procedure. Environment: Intel(R) Xeon(R) Gold 6314U CPU, LLVM 8.0.0 , gcc 9.3.0.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.