Giter Site home page Giter Site logo

rocmarchive / realcaffe2 Goto Github PK

View Code? Open in Web Editor NEW
12.0 12.0 2.0 151.09 MB

The repo is obsolete. Use at your own risk.

Home Page: https://github.com/pytorch/pytorch

License: Apache License 2.0

Shell 0.46% CMake 1.44% Makefile 0.01% C++ 55.04% Python 28.09% C 5.47% Cuda 6.39% Metal 0.38% Objective-C++ 2.33% Objective-C 0.06% CSS 0.02% HTML 0.05% Batchfile 0.05% Dockerfile 0.22%
amd amdgpu caffe2 hpc rocm

realcaffe2's People

Contributors

aazzolini avatar ashishfarmer avatar bddppq avatar benzyx avatar boryiingsu avatar bwasti avatar chocjy avatar enosair avatar harouwu avatar heslami avatar ilia-cher avatar jackielxu avatar jerryzh168 avatar jhcross avatar kennyhorror avatar kittipatv avatar lukeyeager avatar petrex avatar pietern avatar rohithkrn avatar salexspb avatar sf-wind avatar slayton58 avatar sunnieshang avatar urikz avatar viswanathgs avatar volkhin avatar wutao27 avatar xianjiec avatar yangqing avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

realcaffe2's Issues

Exposing MIOPEN to python scripts

Is there a way to expose availability of MIOPEN to python scripts? Because, in places where CUDNN is specified as engine for the operators, it doesn't seem appropriate to check "has_hip" condition to specify MIOPEN as engine, since it misses the cases where hip is not present but MIOPEN is present. @petrex @ashishfarmer

caffe2 utility binary loading shared libraries libcaffe2_hip.so

@ashishfarmer
For hip binary tests we need libcaffe2_hip.so.
How about other utility binaries?Is it necessary? pls share your insight. thx.

pyeh@rocm-miopen /usr/local/bin $ ./make_cifar_db                                                                                                               
./make_cifar_db: error while loading shared libraries: ../../lib/libcaffe2_hip.so: cannot open shared object file: No such file or directory

Operator tests with cub/thrust dependency

Create a list of cub dependent tests. (deferred for now)
We would need to revisit it when RocPRIM is ready.

  • caffe2/operators/elementwise_op_hip.cc
  • caffe2/operators/accuracy_op_hip.cc

HIP FP32 intrinsics

Identify HIP equivalent for FP32 intrinsics __fmaf_rn, __fmul_rn, __fsub_rn and fix the HIP ops

BatchMatMul is failing

Errors for float16 type.
GPU error

ERROR: test_batch_matmul (main.TestBatchMatMul)

Traceback (most recent call last):
File "../caffe2/python/operator_test/matmul_op_test.py", line 145, in test_batch_matmul
@given(
File "/usr/local/lib/python2.7/dist-packages/hypothesis/core.py", line 1049, in wrapped_test
state.run()
File "/usr/local/lib/python2.7/dist-packages/hypothesis/core.py", line 820, in run
falsifying_example.__expected_traceback,
File "/usr/local/lib/python2.7/dist-packages/hypothesis/core.py", line 581, in execute
result = self.test_runner(data, run)
File "/usr/local/lib/python2.7/dist-packages/hypothesis/executors.py", line 58, in default_new_style_executor
return function(data)
File "/usr/local/lib/python2.7/dist-packages/hypothesis/core.py", line 573, in run
return test(*args, **kwargs)
File "../caffe2/python/operator_test/matmul_op_test.py", line 145, in test_batch_matmul
@given(
File "/usr/local/lib/python2.7/dist-packages/hypothesis/core.py", line 520, in test
result = self.test(*args, **kwargs)
File "../caffe2/python/operator_test/matmul_op_test.py", line 195, in test_batch_matmul
relax_fp16_check(self.assertReferenceChecks, gc, op, [X, Y, trans_a, trans_b, dtype], matmul_ref)
File "../caffe2/python/operator_test/matmul_op_test.py", line 192, in relax_fp16_check
check_func(*args, threshold=threshold, **kwargs)
File "/data/rocm_caffe2/build/caffe2/python/hypothesis_test_util.py", line 574, in assertReferenceChecks
workspace.RunNetOnce(net)
File "/data/rocm_caffe2/build/caffe2/python/workspace.py", line 216, in RunNetOnce
StringifyProto(net),
File "/data/rocm_caffe2/build/caffe2/python/workspace.py", line 199, in CallWithExceptionIntercept
return func(*args, **kwargs)
RuntimeError: [enforce fail at math_hip.cc:400] . Unsupported math type Error from operator:
input: "X" input: "Y" output: "out" name: "" type: "BatchMatMul" arg { name: "trans_a" i: 0 } arg { name: "trans_b" i: 0 } device_option { device_type: 4 }

CPU error

ERROR: test_batch_matmul (main.TestBatchMatMul)

Traceback (most recent call last):
File "../caffe2/python/operator_test/matmul_op_test.py", line 145, in test_batch_matmul
@given(
File "/usr/local/lib/python2.7/dist-packages/hypothesis/core.py", line 1049, in wrapped_test
state.run()
File "/usr/local/lib/python2.7/dist-packages/hypothesis/core.py", line 820, in run
falsifying_example.__expected_traceback,
File "/usr/local/lib/python2.7/dist-packages/hypothesis/core.py", line 581, in execute
result = self.test_runner(data, run)
File "/usr/local/lib/python2.7/dist-packages/hypothesis/executors.py", line 58, in default_new_style_executor
return function(data)
File "/usr/local/lib/python2.7/dist-packages/hypothesis/core.py", line 573, in run
return test(*args, **kwargs)
File "../caffe2/python/operator_test/matmul_op_test.py", line 145, in test_batch_matmul
@given(
File "/usr/local/lib/python2.7/dist-packages/hypothesis/core.py", line 520, in test
result = self.test(*args, **kwargs)
File "../caffe2/python/operator_test/matmul_op_test.py", line 195, in test_batch_matmul
relax_fp16_check(self.assertReferenceChecks, gc, op, [X, Y, trans_a, trans_b, dtype], matmul_ref)
File "../caffe2/python/operator_test/matmul_op_test.py", line 192, in relax_fp16_check
check_func(*args, threshold=threshold, **kwargs)
File "/data/rocm_caffe2/build/caffe2/python/hypothesis_test_util.py", line 574, in assertReferenceChecks
workspace.RunNetOnce(net)
File "/data/rocm_caffe2/build/caffe2/python/workspace.py", line 216, in RunNetOnce
StringifyProto(net),
File "/data/rocm_caffe2/build/caffe2/python/workspace.py", line 199, in CallWithExceptionIntercept
return func(*args, **kwargs)
RuntimeError: [enforce fail at operator.h:640] . Unsupported type of tensor: caffe2::__f16 Error from operator:
input: "X" input: "Y" output: "out" name: "" type: "BatchMatMul" arg { name: "trans_a" i: 0 } arg { name: "trans_b" i: 0 } device_option { }

For float32:
CPU is passing but gpu is hanging for gradient check..

pooling test fail when using MIOPEN as engine

miopen poolinf test fail.

Trying example: test_global_avg_pool_nchw(self=<__main__.TestPooling testMethod=test_global_avg_pool_nchw>, op_type='AveragePool2D', sz=2, batch_size=2, engine='MIOPEN', gc=device_type: 4, dc=[, device_type: 4])
*** Aborted at 1521584246 (unix time) try "date -d @1521584246" if you are using GNU date ***
PC: @     0x7fdf717181b1 miopen::PoolingDescriptor::GetForwardOutputDim()
*** SIGSEGV (@0x0) received by PID 62799 (TID 0x7fdfa3e49700) from PID 0; stack trace: ***
    @     0x7fdfa3a39390 (unknown)
    @     0x7fdf717181b1 miopen::PoolingDescriptor::GetForwardOutputDim()
    @     0x7fdf7158e838 miopenGetPoolingForwardOutputDim
    @     0x7fdf95e7e807 caffe2::MIOPENPoolOp::DoRunWithType<>()
    @     0x7fdf95e71f56 caffe2::MIOPENPoolOp::RunOnDevice()
    @     0x7fdf94a661fe caffe2::Operator<>::Run()
    @     0x7fdf76cf6732 caffe2::Workspace::RunOperatorOnce()
    @     0x7fdf9754ab4a _ZZN6caffe26python16addGlobalMethodsERN8pybind116moduleEENKUlRKNS1_5bytesEE26_clES6_.isra.2495.constprop.2527
    @     0x7fdf9754ade5 _ZZN8pybind1112cpp_function10initializeIZN6caffe26python16addGlobalMethodsERNS_6moduleEEUlRKNS_5bytesEE26_bJS8_EJNS_4nameENS_5scopeENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNESQ_
    @     0x7fdf9756ad43 pybind11::cpp_function::dispatcher()
    @           0x4c1d50 PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c1e6f PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c16e7 PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c16e7 PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4d55f3 (unknown)
    @           0x4a577e PyObject_Call
    @           0x4bed3d PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c1e6f PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4d55f3 (unknown)
    @           0x4a577e PyObject_Call
    @           0x4bed3d PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c1e6f PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
    @           0x4c1e6f PyEval_EvalFrameEx
    @           0x4b9ab6 PyEval_EvalCodeEx
^C[1]    62799 segmentation fault (core dumped)  python -m caffe2.python.operator_test.pooling_test


Add HIP device support in the python based operator tests

We need to add hip device support int the following 3 op test.

==================================== ERRORS ====================================
_____________ ERROR collecting python/data_parallel_model_test.py ______________
python/data_parallel_model_test.py:516: in
@unittest.skipIf(workspace.NumCudaDevices() < 2, "Need at least 2 GPUs.")

E AttributeError: 'module' object has no attribute 'NumCudaDevices'
------------------------------- Captured stdout --------------------------------
Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
________________ ERROR collecting python/gradient_check_test.py ________________
python/gradient_check_test.py:41: in
if workspace.has_gpu_support and workspace.NumCudaDevices() > 0:

E AttributeError: 'module' object has no attribute 'NumCudaDevices'
___________ ERROR collecting python/operator_test/load_save_test.py ____________
python/operator_test/load_save_test.py:34: in
max_gpuid = workspace.NumCudaDevices() - 1
E AttributeError: 'module' object has no attribute 'NumCudaDevices'

Fix the HIP assert issues

Identify the assert functionality and in HIP and the roadmap. Fix the HIP operators that use assert(...)

Running MIOpen BN throws warning message

Running SpatialBatchNorm op in MIOpen path generates a warning on the command line:

E0328 18:07:28.225716 127492 spatial_batch_norm_op_miopen.cc:38] Provided epsilon is smaller than MIOPEN_BN_MIN_EPSILON. Setting it to MIOPEN_BN_MIN_EPSILON instead.

This is happening because Caffe2 sets epsilon_ to 1e-5 which is same as MIOPEN_BN_MIN_EPSILON, which triggers the condition on ln 38 of spatial_batch_norm_op_miopen.cc
CUDNN uses condition if (epsilon_ <= CUDNN_BN_MIN_EPSILON - FLT_EPSILON)
instead of if (epsilon_ <= MIOPEN_BN_MIN_EPSILON)

Need to investigate what is proper value of MIOPEN_BN_MIN_EPSILON and the condition

miopen ops looking for reference in libcaffe2.so

Not sure if this is env issue; need to investigate why this is happening (should go to libcaffe2_hip.so instead)

WARNING:root:This caffe2 python run does not have GPU support. Will run in CPU only mode.
WARNING:root:Debug message: No module named caffe2_pybind11_state_gpu
CRITICAL:root:Cannot load caffe2.python. Error: /usr/local/lib/libcaffe2.so: undefined symbol: hipFree

rocm-caffe2 distributed mode

docker images: docker.io/rocm/caffe2:rocm1.7-miopen-dev-v1

For the single node, it is ok when used "python ../caffe2/python/examples/resnet50_trainer.py --train_data=/work/DL/dataset/cifar10/cifar10_train_lmdb/ --num_shards=1 --shard_id=0 --run_id=10 --file_store_path=/work/caffe2/". And the work directory is NFS shared .
But when I run two nodes I got some errors.
node1: python ../caffe2/python/examples/resnet50_trainer.py --train_data=/work/DL/dataset/cifar10/cifar10_train_lmdb/ --num_shards=2 --shard_id=0 --run_id=10 --file_store_path=/work/caffe2/
node2: python ../caffe2/python/examples/resnet50_trainer.py --train_data=/work/DL/dataset/cifar10/cifar10_train_lmdb/ --num_shards=2 --shard_id=1 --run_id=10 --file_store_path=/work/caffe2/

The outputs are as follows:
RuntimeError: [enforce fail at no_default_engine_op.h:45] . The operator CreateCommonWorld does not have a default engine implementation. Please specify an engine explicitly for this operator. Error from operator:
input: "store_handler" output: "allreduce_0_cw" name: "allreduce_0_cw_op" type: "CreateCommonWorld" arg { name: "timeout_ms" i: 30000 } arg { name: "rank" i: 0 } arg { name: "interface" s: "" } arg { name: "status_blob" s: "create_allreduce_cw_0_status" } arg { name: "transport" s: "tcp" } arg { name: "size" i: 2 } device_option { device_type: 4 hip_gpu_id: 0 } engine: "GLOO"

Thank you.

Long running time for python base op test

I am running relu_op test with MIOPEN as engine. Test is passing however running time varies each time.
range is roughly from 0.6 sec to 3.9 sec.

Need to investigate where this is issue in implementation or opportunity to improve the test infra.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.