pku-liang / flextensor Goto Github PK

View Code? Open in Web Editor NEW

174.0 174.0 30.0 9.83 MB

Automatic Schedule Exploration and Optimization Framework for Tensor Computations

License: MIT License

Python 70.65% Makefile 0.06% Cuda 12.86% C++ 3.63% C 12.57% Shell 0.22%

flextensor's People

Contributors

Stargazers

Watchers

flextensor's Issues

missing comma

https://github.com/KnowingNothing/AutoScheduler/blob/6fa3caced89166bcacceb919d868fd11bf5c9a64/auto_schedule/scheduler.py#L1846

optimize_block_celluar.py cannot work with '--target cuda'

Just tested 4 files: optimize_block_celluar.py, optimize_conv1d.py, optimize_conv2d.py, optimize_conv3d.py

optimize_block_celluar.py:

<class 'AssertionError'>
Traceback (most recent call last):
  File "../../../auto_schedule/testing/scheduler.py", line 161, in exec_func
    res = func(*args, **kwargs)
  File "../../../auto_schedule/testing/scheduler.py", line 78, in build_func
    s, bufs = schedule_with_config(task_key, configs, op_pos=op_pos)
  File "../../../auto_schedule/testing/scheduler.py", line 1711, in schedule_with_config
    template(s, op)
  File "../../../auto_schedule/testing/scheduler.py", line 1263, in _cuda_schedule_split_reorder_fuse
    assert pos < len(spatial_remainder)
AssertionError
op build fail:
<class 'AssertionError'>
Traceback (most recent call last):
  File "../../../auto_schedule/testing/scheduler.py", line 161, in exec_func
    res = func(*args, **kwargs)
  File "../../../auto_schedule/testing/scheduler.py", line 78, in build_func
    s, bufs = schedule_with_config(task_key, configs, op_pos=op_pos)
  File "../../../auto_schedule/testing/scheduler.py", line 1711, in schedule_with_config
    template(s, op)
  File "../../../auto_schedule/testing/scheduler.py", line 1263, in _cuda_schedule_split_reorder_fuse
    assert pos < len(spatial_remainder)
AssertionError
op build fail:
<class 'AssertionError'>
Traceback (most recent call last):
  File "../../../auto_schedule/testing/scheduler.py", line 161, in exec_func
    res = func(*args, **kwargs)
  File "../../../auto_schedule/testing/scheduler.py", line 78, in build_func
    s, bufs = schedule_with_config(task_key, configs, op_pos=op_pos)
  File "../../../auto_schedule/testing/scheduler.py", line 1711, in schedule_with_config
    template(s, op)
  File "../../../auto_schedule/testing/scheduler.py", line 1263, in _cuda_schedule_split_reorder_fuse
    assert pos < len(spatial_remainder)
AssertionError
op build fail:
<class 'AssertionError'>
Traceback (most recent call last):
  File "../../../auto_schedule/testing/scheduler.py", line 161, in exec_func
    res = func(*args, **kwargs)
  File "../../../auto_schedule/testing/scheduler.py", line 78, in build_func
    s, bufs = schedule_with_config(task_key, configs, op_pos=op_pos)
  File "../../../auto_schedule/testing/scheduler.py", line 1711, in schedule_with_config
    template(s, op)
  File "../../../auto_schedule/testing/scheduler.py", line 1263, in _cuda_schedule_split_reorder_fuse
    assert pos < len(spatial_remainder)
AssertionError
op build fail:
<class 'AssertionError'>
Traceback (most recent call last):
  File "../../../auto_schedule/testing/scheduler.py", line 161, in exec_func
    res = func(*args, **kwargs)
  File "../../../auto_schedule/testing/scheduler.py", line 78, in build_func
    s, bufs = schedule_with_config(task_key, configs, op_pos=op_pos)
  File "../../../auto_schedule/testing/scheduler.py", line 1711, in schedule_with_config
    template(s, op)
  File "../../../auto_schedule/testing/scheduler.py", line 1263, in _cuda_schedule_split_reorder_fuse
    assert pos < len(spatial_remainder)
AssertionError
op build fail:
<class 'AssertionError'>
Traceback (most recent call last):
  File "../../../auto_schedule/testing/scheduler.py", line 161, in exec_func
    res = func(*args, **kwargs)
  File "../../../auto_schedule/testing/scheduler.py", line 78, in build_func
    s, bufs = schedule_with_config(task_key, configs, op_pos=op_pos)
  File "../../../auto_schedule/testing/scheduler.py", line 1711, in schedule_with_config
    template(s, op)
  File "../../../auto_schedule/testing/scheduler.py", line 1263, in _cuda_schedule_split_reorder_fuse
    assert pos < len(spatial_remainder)
AssertionError
op build fail:
<class 'RuntimeError'>
Traceback (most recent call last):
  File "../../../auto_schedule/testing/scheduler.py", line 161, in exec_func
    res = func(*args, **kwargs)
  File "../../../auto_schedule/testing/scheduler.py", line 82, in build_func
    raise RuntimeError("Invalid %s(%d) kernel"%(task.target, task.dev_id))
RuntimeError: Invalid cuda(0) kernel
op build fail:Invalid cuda(0) kernel

optimize_conv1d.py:

<class 'queue.Empty'>
op run fail:
Traceback (most recent call last):
  File "../../../auto_schedule/testing/scheduler.py", line 1525, in get
    res = self.q.get(block=True, timeout=timeout)
  File "/usr/lib/python3.5/multiprocessing/queues.py", line 105, in get
    raise Empty
queue.Empty
<class 'queue.Empty'>
op run fail:
Traceback (most recent call last):
  File "../../../auto_schedule/testing/scheduler.py", line 1525, in get
    res = self.q.get(block=True, timeout=timeout)
  File "/usr/lib/python3.5/multiprocessing/queues.py", line 105, in get
    raise Empty
queue.Empty
<class 'queue.Empty'>
op build fail:
Traceback (most recent call last):
  File "../../../auto_schedule/testing/scheduler.py", line 1525, in get
    res = self.q.get(block=True, timeout=timeout)
  File "/usr/lib/python3.5/multiprocessing/queues.py", line 105, in get
    raise Empty
queue.Empty
<class 'RuntimeError'>
Traceback (most recent call last):
  File "../../../auto_schedule/testing/scheduler.py", line 161, in exec_func
    res = func(*args, **kwargs)
  File "../../../auto_schedule/testing/scheduler.py", line 82, in build_func
    raise RuntimeError("Invalid %s(%d) kernel"%(task.target, task.dev_id))
RuntimeError: Invalid cuda(0) kernel
op build fail:Invalid cuda(0) kernel

optimize_conv3d.py:

<class 'queue.Empty'>
op run fail:
Traceback (most recent call last):
  File "../../../auto_schedule/testing/scheduler.py", line 1525, in get
    res = self.q.get(block=True, timeout=timeout)
  File "/usr/lib/python3.5/multiprocessing/queues.py", line 105, in get
    raise Empty
queue.Empty
<class 'RuntimeError'>
Traceback (most recent call last):
  File "../../../auto_schedule/testing/scheduler.py", line 161, in exec_func
    res = func(*args, **kwargs)
  File "../../../auto_schedule/testing/scheduler.py", line 82, in build_func
    raise RuntimeError("Invalid %s(%d) kernel"%(task.target, task.dev_id))
RuntimeError: Invalid cuda(0) kernel
op build fail:Invalid cuda(0) kernel
<class 'RuntimeError'>
Traceback (most recent call last):
  File "../../../auto_schedule/testing/scheduler.py", line 161, in exec_func
    res = func(*args, **kwargs)
  File "../../../auto_schedule/testing/scheduler.py", line 82, in build_func
    raise RuntimeError("Invalid %s(%d) kernel"%(task.target, task.dev_id))
RuntimeError: Invalid cuda(0) kernel
op build fail:Invalid cuda(0) kernel
<class 'queue.Empty'>
op run fail:
Traceback (most recent call last):
  File "../../../auto_schedule/testing/scheduler.py", line 1525, in get
    res = self.q.get(block=True, timeout=timeout)
  File "/usr/lib/python3.5/multiprocessing/queues.py", line 105, in get
    raise Empty
queue.Empty
<class 'queue.Empty'>
op run fail:
Traceback (most recent call last):
  File "../../../auto_schedule/testing/scheduler.py", line 1525, in get
    res = self.q.get(block=True, timeout=timeout)
  File "/usr/lib/python3.5/multiprocessing/queues.py", line 105, in get
    raise Empty
queue.Empty
<class 'RuntimeError'>
Traceback (most recent call last):
  File "../../../auto_schedule/testing/scheduler.py", line 161, in exec_func
    res = func(*args, **kwargs)
  File "../../../auto_schedule/testing/scheduler.py", line 82, in build_func
    raise RuntimeError("Invalid %s(%d) kernel"%(task.target, task.dev_id))
RuntimeError: Invalid cuda(0) kernel
op build fail:Invalid cuda(0) kernel

and optimize_conv2d.py didn't output anything even with '--target llvm'

Where is the "kernel.c" file for the gemmini examples?

Hello,

Thanks for the amazing work and codes.

I am trying to run your examples and tests, but among them, I have not been able to run gemmini examples such as '/testing/others/hand-craft/gemmini-*.py'.

These examples require "kernel.c" file, but when I looked it up, the current project didn't have that file.

Can you tell me where the "kernel.c" file is? If not, can I get the file?

Thanks,

problems in DQN search

I ran the optimize/optimize_conv2d.py with the following command:

python3 optimize_conv2d.py --shapes yolo --target cuda --trials 1000 --timeout 10 --parallel 8 --log tmp_log.txt --method q

However, after a lot of warnings about the warm up things, I got this issue

Traceback (most recent call last):

  File "optimize_conv2d.py", line 204, in <module>
    logfile=flog,

  File "optimize_conv2d.py", line 116, in optimize
    rpc_info=rpc_info,

  File "/home/max/workspaces/python/FlexTensor/flextensor/scheduler.py", line 2100, in schedule
    perf_path=perf_path,

  File "/home/max/workspaces/python/FlexTensor/flextensor/scheduler.py", line 670, in schedule
    return self._q_schedule(configs, wanted_types, use_model=use_model)

  File "/home/max/workspaces/python/FlexTensor/flextensor/scheduler.py", line 443, in _q_schedule
    from_lst, next_points, action_lst = self.walker_group.walk(cur_lst, trial)

  File "/home/max/workspaces/python/FlexTensor/flextensor/model.py", line 359, in walk
    next_index_lst, direction_lst = self.walkers[name].walk(flattened_lst, index_lst, trial, epsilon, gamma)

  File "/home/max/workspaces/python/FlexTensor/flextensor/model.py", line 72, in walk
    q_values_lst = self.pre_judger(torch.FloatTensor(inputs)).detach()

  File "/home/max/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)

  File "/home/max/workspaces/python/FlexTensor/flextensor/model.py", line 34, in forward
    out = self.net(inputs)

  File "/home/max/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)

  File "/home/max/.local/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)

  File "/home/max/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)

  File "/home/max/.local/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)

  File "/home/max/.local/lib/python3.6/site-packages/torch/nn/functional.py", line 1372, in linear
    output = input.matmul(weight.t())

RuntimeError: size mismatch, m1: [1 x 0], m2: [22 x 64] at /pytorch/aten/src/TH/generic/THTensorMath.cpp:197

The same issue arised in another op I wrote by myself with "method=q" setting.

Are there any instructions on using DQN search?

How to encode and schedule for "reorder"?

Hello,

Thanks for the amazing work and codes.

I looked into the paper and codes but was confused about how you encode "reorder" schedule in the space. I have some questions.

In the codes I found
reorder_space = generate_reorder_space(groups) and groups = 3 by default
Shall I change groups equal to rlevel*slevel?
The configs it generated is just one dimensional, and does not match the description in the paper, i.e.,
“reorder”: 𝑖1, 𝑗1, 𝑖2, 𝑗2, 𝑘1, 𝑖3, 𝑘2, 𝑗3, 𝑘3, 𝑖4, 𝑘4, 𝑗4 in Figure 3

Could you please explain these? Thanks.

What is the meaning of tvm_generic and tvm_opt?

I found there is an argument named type in argument list of conv2d_baseline.py, and its value can be tvm_generic, tvm_opt or pytorch. What the meaning of type, tvm_generic and tvm_opt?

Some problem while running on GPU

I want to test the performance of C9 of yolo after FlexTensor's optimization, but there seems to be some problems when running optimize_conv2d.py on GPU

$ python optimize_conv2d.py --shapes yolo --from 8 --to 9 --parallel 16 --target cuda
......
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 16
warm up [1599394505.223908] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 16
warm up [1599394508.009939] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 16
warm up [1599394510.781969] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 16
Fail to find valid schedule, too many errors
warm up [1599394513.576313] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 16
warm up [1599394516.424372] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 16
......

I have seen a previous issue and the current code uses 'spawn' when using multiprocessing.
It seems that it will not stop running because it can't find a suitable schedule.

Some problem while running on GPU

There seems some problems while running the second case on GPU:

$ python3 optimize_block_circulant_matrix.py --target cuda --parallel 20 --timeout 120
Optimize block_circulant_matrix shape (1024, 256, 8)
......
######################################
op schedules:
----------------------------------
fuse [[1, 2, 2]]
spatial [[8, 1, 4, 4], [16, 1, 2, 8]]
reduce [[1, 1, 8]]
reorder [[0]]
unroll [[512, 1]]
----------------------------------
fuse [[1, 2, 2]]
spatial [[32, 1, 2, 16], [2, 4, 32, 1]]
reorder [[0]]
unroll [[1, 1]]
graph schedules:
inline [[0, 0]]
merge [[0, 1]]
block_circulant_matrix_block_circulant_matrix_(1024, 256, 8)_cuda(0):[[{"fuse": [[1, 2, 2]], "spatial": [[8, 1, 4, 4], [16, 1, 2, 8]], "reduce": [[1, 1, 8]], "reorder": [[0]], "inline": [], "unroll": [[512, 1]], "merge": []}, {"fuse": [[1, 2, 2]], "spatial": [[32, 1, 2, 16], [2, 4, 32, 1]], "reduce": [], "reorder": [[0]], "inline": [], "unroll": [[1, 1]], "merge": []}], {"fuse": [], "spatial": [], "reduce": [], "reorder": [], "inline": [[0, 0]], "unroll": [], "merge": [[0, 1]]}]
Use 0.24673579999999998 ms
Cost 1717.7187826633453 s

Optimize block_circulant_matrix shape (1024, 256, 16)
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 20
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 20
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 20
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 20
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 20
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 20
.......

but if I only run the 2nd case, everything seems OK

$ python3 optimize_block_circulant_matrix.py --target cuda --parallel 20 --timeout 120 -f 1 -t 2
Optimize block_circulant_matrix shape (1024, 256, 16)
......
######################################
op schedules:
----------------------------------
fuse [[1, 2, 2]]
spatial [[32, 1, 2, 1], [4, 1, 4, 16]]
reduce [[1, 2, 8]]
reorder [[1]]
unroll [[512, 0]]
----------------------------------
fuse [[1, 2, 2]]
spatial [[128, 4, 1, 2], [4, 4, 16, 1]]
reorder [[2]]
unroll [[512, 1]]
graph schedules:
inline [[0, 0]]
merge [[1, 0]]
block_circulant_matrix_block_circulant_matrix_(1024, 256, 16)_cuda(0):[[{"fuse": [[1, 2, 2]], "spatial": [[32, 1, 2, 1], [4, 1, 4, 16]], "reduce": [[1, 2, 8]], "reorder": [[1]], "inline": [], "unroll": [[512, 0]], "merge": []}, {"fuse": [[1, 2, 2]], "spatial": [[128, 4, 1, 2], [4, 4, 16, 1]], "reduce": [], "reorder": [[2]], "inline": [], "unroll": [[512, 1]], "merge": []}], {"fuse": [], "spatial": [], "reduce": [], "reorder": [], "inline": [[0, 0]], "unroll": [], "merge": [[1, 0]]}]
Use 0.0080564 ms
Cost 1806.2457213401794 s

and if I run case 3~6, then case 3 will be OK but case 4 will fail

$ python3 optimize_block_circulant_matrix.py --target cuda --parallel 20 --timeout 120 -f 2 -t 6
Optimize block_circulant_matrix shape (1024, 512, 8)
......

Optimize block_circulant_matrix shape (1024, 512, 16)
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 20
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 20
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 20
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 20
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 20
warm up [inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf, inf]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 20
......

same problem also appear when I run some other scripts such as 'optimize_conv1d.py' and 'optimize_gemm.py'

@KnowingNothing
I have a question regarding the call path of the parallel_evaluate routines in FlexTensor.
When tuning for an "llvm" target, it makes sense that the "parallel" option should be set to <= num cores as to have a form of isolation on the HW and thus have accurate performance measurements, where each measurement occurs on a separate core.
In the case of "cuda" target, where the candidate schedules are executed on the GPU, this isn't as straightforward.

When looking at the number of simultaneously executing schedules on the GPU, I could observe several simultaneous processes being present on the device - confirmed via inspection through nvidia-smi - during a single measurement trial. This is somewhat concerning as it causes multiple processes to potentially compete for GPU resources at the same time during measurement, effectively causing the measurement to be inaccurate as the candidates can be scheduled in a different order internally by the nvidia's cuda driver scheduler.

FlexTensor seems to reuse some of the tvm's RPC pipeline for evaluating modules, namely "func.time_evaluator" which goes through the RPCTimeEvaluator module, in basically the same fashion as it happens in AutoTVM, with the main difference being that in AutoTVM, the LocalRunner (locally spawned RPC server & tracker) keeps the "parallel" option hardcoded to 1 as to maintain the isolation between 2 separate candidate schedules being measured on the device. I thought its maybe something to do with the torch.multiprocessing module, however it seems to be a simple wrapper over Python's multiprocessing which allows to share a portion of memory between several processes and on face-value, has nothing to do with isolation when it comes to GPU execution.

For "cuda" targets, would you recommend to stick with parallel = 1 to ensure process isolation or maybe I'm missing something obvious that ensures the isolation is maintained despite multiple processes executing on the device simultaneously?
I've tested this with AutoTVM and Ansor and they both seem to isolate each candidate and as such, you can only see a single process being registered in nvidia-smi at one time.

Any chance you could clarify the above?

Many thanks! :)

cannot import test_graph_schedule_gpu_general_dx function

from flextensor.test import test_graph_schedule_gpu_general_dx
ImportError: cannot import name 'test_graph_schedule_gpu_general_dx'

This function is called by serveral files in flextensor/examples directory.

Problem for importing flextensor

Hi! I am following your guide to run the tutorial code: https://pku-ahs.github.io/tutorial/en/master/steps.html#install

I have installed and pulled your docker in my WSL, and I have completed these steps:

cd FlexTensor-Micro
export PYTHONPATH=$PYTHONPATH:/path/to/FlexTensor-Micro
cd FlexTensor-Micro/flextensor/tutorial

# First, CPU experiments
cd conv2d_llvm

# run flextensor
python optimize_conv2d.py --shapes res --target llvm --parallel 8 --timeout 20 --log resnet_config.log

But I am stuck by this bug:

root@4f3f7856e4d4:/FlexTensor-Micro/flextensor/tutorial/conv2d_llvm# python3 optimize_conv2d.py --shapes res --target llvm --parallel 8 --timeout 20 --log resnet_config.log
Traceback (most recent call last):

  File "optimize_conv2d.py", line 9, in <module>
    from flextensor.utils import Config, RpcInfo

ModuleNotFoundError: No module named 'flextensor'

Could you kindly help me? Thank you!!!!

Implementation for FPGA backend?

As title shows, I didn't find scheduling exploration for FPGA. And, the baselines folder doesn't contain FPGA either.

No valid solution found using the tutorial example

Source Code:

import tvm

def gemm(A, B):
    k = tvm.reduce_axis((0, B.shape[0]))
    return tvm.compute((A.shape[0], B.shape[1]), lambda i, j: tvm.sum(A[i, k] * B[k, j], axis=k))

def wrap_gemm(N, K, M):
    A = tvm.placeholder((N, K))
    B = tvm.placeholder((K, M))
    Output = gemm(A, B)
    return [Output.op], [A, B, Output]

from flextensor.task import register_task, Task
from flextensor.model import WalkerGroup
from flextensor.scheduler import schedule

if __name__ == '__main__':
  task = Task(
    "gemm",
    "gemm",
    wrap_gemm,
    (1024, 1024, 1024),
    "llvm",
    0)
  register_task(task)
  s, bufs, configs = schedule(
            task.key, # give the key of target task
            slevel=4,
            rlevel=3,
            op_trial=100,
            timeout=10,
            op_stop=30,
            method="searching",
            parallel=4,
            )

Output:

graph space size 1
op 0 space size: 43188288
[Warning] Directory lib is not empty, but reusing it
warm up [1584960963.652893] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
warm up [1584960977.754150] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
warm up [1584960991.611287] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
warm up [1584961005.486223] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
warm up [1584961019.563355] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
warm up [1584961033.576784] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
warm up [1584961047.659202] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
warm up [1584961061.669718] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
warm up [1584961075.572861] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
warm up [1584961089.390276] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
warm up [1584961103.236112] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
warm up [1584961117.260888] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
warm up [1584961131.503041] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
warm up [1584961145.477987] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
warm up [1584961159.549252] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
warm up [1584961173.469501] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
warm up [1584961187.413991] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
warm up [1584961201.389768] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
warm up [1584961215.701637] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
warm up [1584961230.179169] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 20
warm up [1584961244.064868] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 20
warm up [1584961257.944109] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 20
warm up [1584961271.798598] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 20
warm up [1584961285.662753] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
Warning: No valid schedule found in warm up process, please use more trials
...
...

[CPU] TypeError: reduce() of empty sequence with no initial value

Hi,
I'm running optimize/optimize_conv2d.py with a custom convolution layer on a CPU and I get the following error. I understand why this happens but shouldn't there be a guard when config['spatial'] is empty?

The layer parameters are (128, 128, 28, 28, 256,_ , 3, 3, _, 1, 1, 1, 1) and optimize_conv2d.py uses default args.

warm up [1606782542.740292] [ inf ]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 1
Fail to find valid schedule, too many errors

Traceback (most recent call last):

  File "optimize_conv2d.py", line 210, in <module>
    logfile=flog,

  File "optimize_conv2d.py", line 116, in optimize
    rpc_info=rpc_info,

  File "/homes/tharindu/FlexTensor/flextensor/scheduler.py", line 2135, in schedule
    s, bufs = schedule_with_config(task_key, configs, rewrite=rewrite)

  File "/homes/tharindu/FlexTensor/flextensor/scheduler.py", line 2155, in schedule_with_config
    s, bufs = schedule_with_config_ops(ops, bufs, configs, op_pos=op_pos, target=task.target)

  File "/homes/tharindu/FlexTensor/flextensor/scheduler.py", line 2202, in schedule_with_config_ops
    template(s, op, op_states[i])

  File "/homes/tharindu/FlexTensor/flextensor/scheduler.py", line 1680, in _cpu_schedule_simple
    tmp_extent = reduce(lambda a, b: a * b, [x[count] for x in config["spatial"]])

TypeError: reduce() of empty sequence with no initial value

Thanks,

'test_graph_schedule_cpu_general_dx' can not be imported

Hello,

I got the following error when try to run some examples, like flextensor/examples/opt_gemm_cpu.py.

Traceback (most recent call last):

  File "opt_gemm_cpu.py", line 6, in <module>
    from flextensor.test import test_graph_schedule_cpu_general_dx

ImportError: cannot import name 'test_graph_schedule_cpu_general_dx'

Is there any solution for that?

Thanks!

pku-liang / flextensor Goto Github PK

flextensor's People

Contributors

Stargazers

Watchers

Forkers

flextensor's Issues

Recommend Projects

Recommend Topics

Recommend Org