microsoft / ark Goto Github PK

A GPU-driven system framework for scalable AI applications

License: MIT License

C++ 87.25% C 0.25% Python 8.66% CMake 1.25% Dockerfile 0.44% Cuda 2.03% Shell 0.12%

ark's Introduction

ARK

A GPU-driven system framework for scalable AI applications.

Pipelines	Build Status
Unit Tests (CUDA)
Unit Tests (ROCm)

NOTE (Nov 2023): ROCm unit tests will be replaced into an Azure pipeline in the future.

See Quick Start to quickly get started.

Overview

ARK is a deep learning framework especially designed for highly optimized performance over distributed GPUs. Specifically, ARK adopts a GPU-driven execution model, where the GPU autonomously schedule and execute both computation and communication without any CPU intervention.

ARK provides a set of APIs for users to express their distributed deep learning applications. ARK then automatically schedules a GPU-driven execution plan for the application, which generates a GPU kernel code called loop kernel. The loop kernel is a GPU kernel that contains a loop that iteratively executes the entire application, including both computation and communication. ARK then executes the loop kernel on the distributed GPUs.

Status & Roadmap

ARK is under active development and a part of its features will be added in a future release. The following describes key features of each version.

New in ARK v0.5 (Latest Release)

Integrate with MSCCL++
Removed dependency on gpudma
Add AMD CDNA3 architecture support
Support communication for AMD GPUs
Optimize OpGraph scheduling
Add a multi-GPU Llama2 example

See details from #168.

ARK v0.6 (TBU, Jan. 2024)

Overall performance optimization
Improve Python unit tests & code coverage

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Citations

ARK is a collaborative research initiative between KAIST and Microsoft Research. If you use this project in your research, please cite our NSDI'23 paper:

@inproceedings{HwangPSQCX23,
  author    = {Changho Hwang and
               KyoungSoo Park and
               Ran Shu and
               Xinyuan Qu and
               Peng Cheng and
               Yongqiang Xiong},
  title     = {ARK: GPU-driven Code Execution for Distributed Deep Learning},
  booktitle = {20th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 23)},
  year      = {2023},
  publisher = {{USENIX} Association},
}

ark's People

Contributors

Stargazers

Watchers

Forkers

gmh5225 wangwenjie123 machinelearningsystem yodamaster yzygitzh zhaojp-frank markyoungs xioaxin chosen808 jihaoxin dearborn-open-ai terrencezhangx chriss-0x01 dimanzt naturalcandy

ark's Issues

[Bug] CK GeMM correctness bug

Describe the bug
test_matmul_fp32 and test_matmul_fp16_split fails on MI300x.

To Reproduce
Run the unit test.

Expected behavior
max_diff should be lower than the calculated value.

System (please complete the following information):

MI300x
Single-GPU

Additional context

transformer_test.py prompts [AttributeError: 'Model' object has no attribute 'tensor']

ENV: 2 A100s and ran the provided image from ARK version 0.3.0 to 0.5.0.
BUG: When running $python3 transformer_test.py in examples/transformer, it will prompt AttributeError: 'Model' object has no attribute 'tensor'.
e.g. ARK 0.5.0:
Traceback (most recent call last):
File "examples/transformer/transformer_test.py", line 560, in
test_PoswiseFeedForwardNet()
File "examples/transformer/transformer_test.py", line 15, in test_PoswiseFeedForwardNet
input_tensor= model.tensor(

AttributeError: 'Model' object has no attribute 'tensor
Please tell us how to solve this problem, thank you.

ARK v0.4.0 Release Plan (Released)

Timeline

Released Date: Nov. 14th, 2023

Work Items (TBU)

Platforms Support

- ROCm: add ROCm backend support (#162)

Operators Support

- ~~[ ] Operator: add int8 operators~~ (dropping this plan for now)
- Operator: add more AllReduce & AllGather algorithms (#152)

Examples

- ~~[ ] Example: add Llama2 multi-GPU examples~~ (moved to the next version release)

CI

- ~~[ ] Unit Tests: revise Python unit tests & add to the Azure pipeline~~ (moved to the next version release)
- ~~[ ] Code Coverage: add code coverage for Python code~~ (moved to the next version release)

Bug Fix

Action required: migrate or opt-out of migration to GitHub inside Microsoft

Migrate non-Open Source or non-External Collaboration repositories to GitHub inside Microsoft

In order to protect and secure Microsoft, private or internal repositories in GitHub for Open Source which are not related to open source projects or require collaboration with 3rd parties (customer, partners, etc.) must be migrated to GitHub inside Microsoft a.k.a GitHub Enterprise Cloud with Enterprise Managed User (GHEC EMU).

Action

✍️ Please RSVP to opt-in or opt-out of the migration to GitHub inside Microsoft.

❗Only users with admin permission in the repository are allowed to respond. Failure to provide a response will result to your repository getting automatically archived.🔒

Instructions

Reply with a comment on this issue containing one of the following optin or optout command options below.

✅ Opt-in to migrate

@gimsvc optin --date <target_migration_date in mm-dd-yyyy format>

Example: @gimsvc optin --date 03-15-2023

❌ Opt-out of migration

@gimsvc optout --reason <staging|collaboration|delete|other>

Example: @gimsvc optout --reason staging

Options:

staging : This repository will ship as Open Source or go public

collaboration : Used for external or 3rd party collaboration with customers, partners, suppliers, etc.

delete : This repository will be deleted because it is no longer needed.

other : Other reasons not specified

Need more help? 🖐️

Email [email protected]. ✉️
Post your questions in GitHub inside Microsoft Team in Microsoft Teams. 🗨️

[Bug]Ark0.4.1 multi_gou_tutorial.py run error

Describe the bug
ark0.4.1: run multi_gou_tutorial.py fail in sched_default.cc
line393 in configure_gpu_buf, tensor.cc line246 in update_pads, errors as follow:
invalid padding detected. This is likely caused because one GPU buffer is used by multiple operators that require different padding. A possible workaround is to let each operator use a different buffer by creating a new tensor rather than overwriting an existing tensor op name:send.

To Reproduce
run multi_gou_tutorial.py in ark0.4.1

Expected behavior

explain why has the error;
what relationship "ldims, type_bytes, tile" between ref_tensor and this_tensor satisfy in updae_pads?

System (please complete the following information):

ark0.4.1
OS: [e.g. Ubuntu18.04]
GPU [A100]
Networking Environment [Single-node, Multi－gpu]

Additional context
Add any other context about the problem here.

ARK v0.3.0 New Operators

~~[ ] int8 type support in many operators (#72)~~ moved to the next version plan
bfloat16 type support in many operators (#142)
embedding: add support (#122)
cast: add support (#127)

ARK v0.2.0 Release Plan (Released)

Timeline

Released Date: Sep. 5th, 2023

Work Items

Model

- Interface: expose the underlying buffer info to Tensor (#79)

Scheduler

- ~~[ ] Graph Optimization: enable this feature~~ moved to the next version
- ~~[ ] SimpleScheduler: fix broken features~~ moved to the next version

Communication Stack

- Interface: hide GpuCommSw implementation from the interface (#81)
- Interface: extend the current interface (#104)

Operators Support

- Operator: add more operators (#62)
- Operator: upgrade CUTLASS (#105)

Python

- Interface: #96

Examples

- ~~[ ] Example: add Llama2 example (#102)~~ moved to the next version
- Example: parallel matmul example (#64)

Bug Fix

Documents

- Docs: update documents (#76, #78, #87)

CI

- Code Coverage: add code coverage (#110)
- Unit Tests: add a unit test pipeline (#88)
- Unit Tests: #91

build error

Describe the bug
When I build the resource code by the install document in ubuntu 20.04 , it happens this following errors:

In file included from /root/ark/third_party/mscclpp/src/include/atomic.hpp:9,
                 from /root/ark/third_party/mscclpp/src/fifo.cc:8:
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp: In function ‘T mscclpp::atomicLoad(T*, cuda::__3::memory_order)’:
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:25:16: error: ‘atomic_ref’ is not a member of ‘cuda’
   25 |   return cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.load(memoryOrder);
      |                ^~~~~~~~~~
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:25:28: error: expected primary-expression before ‘,’ token
   25 |   return cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.load(memoryOrder);
      |                            ^
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:25:56: error: expected primary-expression before ‘{’ token
   25 |   return cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.load(memoryOrder);
      |                                                        ^
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:25:56: error: expected ‘;’ before ‘{’ token
   25 |   return cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.load(memoryOrder);
      |                                                        ^
      |                                                        ;
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:25:61: error: expected ‘;’ before ‘}’ token
   25 |   return cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.load(memoryOrder);
      |                                                             ^
      |                                                             ;
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:25:62: error: expected primary-expression before ‘.’ token
   25 |   return cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.load(memoryOrder);
      |                                                              ^
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp: In function ‘void mscclpp::atomicStore(T*, const T&, cuda::__3::memory_order)’:
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:30:9: error: ‘atomic_ref’ is not a member of ‘cuda’
   30 |   cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.store(val, memoryOrder);
      |         ^~~~~~~~~~
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:30:21: error: expected primary-expression before ‘,’ token
   30 |   cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.store(val, memoryOrder);
      |                     ^
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:30:49: error: expected primary-expression before ‘{’ token
   30 |   cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.store(val, memoryOrder);
      |                                                 ^
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:30:55: error: expected primary-expression before ‘.’ token
   30 |   cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.store(val, memoryOrder);
      |                                                       ^
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp: In function ‘T mscclpp::atomicFetchAdd(T*, const T&, cuda::__3::memory_order)’:
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:35:16: error: ‘atomic_ref’ is not a member of ‘cuda’
   35 |   return cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.fetch_add(val, memoryOrder);
      |                ^~~~~~~~~~
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:35:28: error: expected primary-expression before ‘,’ token
   35 |   return cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.fetch_add(val, memoryOrder);
      |                            ^
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:35:56: error: expected primary-expression before ‘{’ token
   35 |   return cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.fetch_add(val, memoryOrder);
      |                                                        ^
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:35:56: error: expected ‘;’ before ‘{’ token
   35 |   return cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.fetch_add(val, memoryOrder);
      |                                                        ^
      |                                                        ;
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:35:61: error: expected ‘;’ before ‘}’ token
   35 |   return cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.fetch_add(val, memoryOrder);
      |

System:

OS: Ubuntu20.04
GPU Geforce 3060 (single)
Compiler: g++(9.4), cmake (3.28.1)
NVCC: 11.4

[Feature]Can you update the latency comparison between using ark and not using it in existing examples, e.g., Llama demo?

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

ARK v0.5.0 Release Plan (Released)

Timeline

Released Date: Dec. 16th, 2023

Work Items

Major Improvement

- MSCCL++: integrated with MSCCL++ and removed dependency on gpudma (#179)

Platforms Support

- ROCm: add ROCm multi-GPU support (#181)

Operators

- reduce: keepdims support for reduction (#173)

Optimization

- OpGraph: optimize OpGraph scheduling (#182)

Examples

- Example: add Llama2 multi-GPU examples (#170)

CI

- ~~[ ] Unit Tests: revise Python unit tests & add to the Azure pipeline~~ (moved to the next version release plan)
- ~~[ ] Unit Tests: add ROCm Azure pipelines~~ (moved to the next version release plan)
- ~~[ ] Code Coverage: add code coverage for Python code~~ (moved to the next version release plan)

ARK v0.3.x Known Bugs & Issues

- Support both source and destination offsets in NetIbQp::stage_send()
- Offsets of importing/exporting tensors are not properly handled
- Use Kahan sum for layernorm (#159)

ARK v0.6.0 Release Plan

Timeline

Expected Release Date: ~~Jan. 16th, 2024~~ Mar. 31st, 2024

Work Items (TBU)

Performance

- Computation Kernel: improved vectorization (#189)
- Communication Kernel: improve performance (#190)

CI

- Unit Tests: revise Python unit tests & add to the Azure pipeline
- Unit Tests: add ROCm Azure pipelines
- Unit Tests: add ROCm 6.0 CI (#188)
- Code Coverage: add code coverage for Python code

Bug Fix

- CUTLASS: fix the patch file (#187)
- OpGraph: a bug fix (#192)

ARK v0.4.x Known Bugs & Issues

- Support both source and destination offsets in NetIbQp::stage_send() (#179)
- Offsets of importing/exporting tensors are not properly handled (#179)

ARK v0.1.0 Known Bugs & Issues

- Tensor::is_sequential() may need more strict conditions. (#111)
- Executor::tensor_memcpy_host_to_device() will cause unknown error if the tensors on the host device is not sequential. We need more check about the tensor on the host or mabe need a python warpper for this (#48)
- Sometime if the tensor is padded, the allgather operation might overwrite the recv tensor, and the allreduce tensor will also be incorrect. (@chhwang: now send/recv checks contiguity)
- Current layernorm and sofxmax operation is scheduled using a quite hack way, might needs for more update in the future. (#59)
- ark.init() is not working. (#39)
- Layernorm need a recv dependency at its output (@chhwang: it already has)
- ARK environments are not working for Python (#54)
- ~~[ ] Support both source and destination offsets in NetIbQp::stage_send()~~ moved to the next version

Remove a misleading error message:

ark/ark/ops/ops_all_gather.cc

Lines 19 to 23 in 420c236

    
           if (input->ndims() > 1) { 
        
               LOG(INFO, 
        
                   "warning: if the send tensor if not contiguous, the all_gather " 
        
                   "may not work correctly"); 
        
           }

(#111)

- ops_matmul_test.cc is not checking error rates correctly (#91)
- send_mm and recv_mm are temporarily broken (#52)
- When using python -m unittest discover -s . -p "test_*.py" to run all unittest, the snedrecv test will fail, but when we run them seperately, their will be no problem. Seems that in some cases the previous runtime context is not destroyed when one unittest finished and another unittest start. This problem also exist in the current main branch. (@chhwang: this is the test code's issue, won't fix for now)
- matmul test failed for matmul larger than 128, 2048, 1024 (#54)
- ~~[ ] Offsets of importing/exporting tensors are not properly handled~~ moved to the next version
- matmul unittesttest failed for test_matmul_transpose (#94)
- float matmul error rate seems too high but it's unclear if it is ARK's issue or the test code issue (@chhwang: this is not an issue)

ARK v0.3.0 Release Plan (Released)

Timeline

Released Date: Oct. 4th, 2023

Work Items

Interface

- Python APIs: revise interface
- #120
- #124
- Communication: revise send/recv interfaces (#138)

Scheduler

- Featire: enable heuristic graph optimization (#136)
- Feature: support mixed precision (#134)
- ~~[ ] SimpleScheduler: fix broken features~~ plan changed -- deprecate SimpleScheduler

Communication Stack

- Interface: make send/recv interface simpler (#138)

Operators Support

- Operator: add more operators (#107)

Examples

- Example: add Llama2 example (#121)

CI

- Code Coverage: improve coverage (#119)

Bug Fix

Fix install bugs (#116)
Fix a batched matmul bug (#117)
Fix a padded matmul bug (#129)
Fix incorrect kernels
- #131
- #132
- #133
Fix a Python 3.11 installation issue (#135)
#112

ARK v0.2.0 New Operators

ARK v0.2.x Known Bugs & Issues

- Support both source and destination offsets in NetIbQp::stage_send()
- Offsets of importing/exporting tensors are not properly handled
- Segfault when a model uses many SIDs (#115)
- Use Kahan sum for layernorm

why RTX3060 with Ampere architecture is not supported?

For ark v0.5.0, the file named docs/install.md shows all supported NVIDIA GPU include "Volta (CUDA >= 11.1) / Ampere (CUDA >= 11.1) / Hopper (CUDA >= 12.0)", but my RTX3060 (Ampere) with capability 8.6 cannot work with ark. Why? I also found below code in the file ark/ops/ops_common.cc, it means only GPU with capability 6.0/7.0/8.0/9.0 are supported, right?

pArchType op_arch_from_string(const std::string &arch) {
if (arch == "cuda_60") {
return OP_ARCH_CUDA_60;
} else if (arch == "cuda_70") {
return OP_ARCH_CUDA_70;
} else if (arch == "cuda_80") {
return OP_ARCH_CUDA_80;
} else if (arch == "cuda_90") {
return OP_ARCH_CUDA_90;
} else if (arch == "rocm_90a") {
return OP_ARCH_ROCM_90A;
} else if (arch == "rocm_942") {
return OP_ARCH_ROCM_942;
}
return OP_ARCH_UNKNOWN;
}

	if (input->ndims() > 1) {
	LOG(INFO,
	"warning: if the send tensor if not contiguous, the all_gather "
	"may not work correctly");
	}

microsoft / ark Goto Github PK

ark's Introduction

ARK

Overview

Status & Roadmap

New in ARK v0.5 (Latest Release)

ARK v0.6 (TBU, Jan. 2024)

Contributing

Trademarks

Citations

ark's People

Contributors

Stargazers

Watchers

Forkers

ark's Issues

Timeline

Work Items (TBU)

Platforms Support

Operators Support

Examples

CI

Bug Fix

Migrate non-Open Source or non-External Collaboration repositories to GitHub inside Microsoft

Action

Instructions

Need more help? 🖐️

Timeline

Work Items

Model

Scheduler

Communication Stack

Operators Support

Python

Examples

Bug Fix

Documents

CI

Timeline

Work Items

Major Improvement

Platforms Support

Operators

Optimization

Examples

CI

Timeline

Work Items (TBU)

Performance

CI

Bug Fix

Timeline

Work Items

Interface

Scheduler

Communication Stack

Operators Support

Examples

CI

Bug Fix

Recommend Projects

Recommend Topics

Recommend Org