Giter Site home page Giter Site logo

ark's Introduction

ARK

A GPU-driven system framework for scalable AI applications.

Latest Release License CodeQL codecov

Pipelines Build Status
Unit Tests (CUDA) Build Status
Unit Tests (ROCm) Unit Tests (ROCm)

NOTE (Nov 2023): ROCm unit tests will be replaced into an Azure pipeline in the future.

See Quick Start to quickly get started.

Overview

ARK is a deep learning framework especially designed for highly optimized performance over distributed GPUs. Specifically, ARK adopts a GPU-driven execution model, where the GPU autonomously schedule and execute both computation and communication without any CPU intervention.

ARK provides a set of APIs for users to express their distributed deep learning applications. ARK then automatically schedules a GPU-driven execution plan for the application, which generates a GPU kernel code called loop kernel. The loop kernel is a GPU kernel that contains a loop that iteratively executes the entire application, including both computation and communication. ARK then executes the loop kernel on the distributed GPUs.

GPU-driven System Architecture

Status & Roadmap

ARK is under active development and a part of its features will be added in a future release. The following describes key features of each version.

New in ARK v0.5 (Latest Release)

  • Integrate with MSCCL++
  • Removed dependency on gpudma
  • Add AMD CDNA3 architecture support
  • Support communication for AMD GPUs
  • Optimize OpGraph scheduling
  • Add a multi-GPU Llama2 example

See details from #168.

ARK v0.6 (TBU, Jan. 2024)

  • Overall performance optimization
  • Improve Python unit tests & code coverage

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Citations

KAIST and Microsoft Logos

ARK is a collaborative research initiative between KAIST and Microsoft Research. If you use this project in your research, please cite our NSDI'23 paper:

@inproceedings{HwangPSQCX23,
  author    = {Changho Hwang and
               KyoungSoo Park and
               Ran Shu and
               Xinyuan Qu and
               Peng Cheng and
               Yongqiang Xiong},
  title     = {ARK: GPU-driven Code Execution for Distributed Deep Learning},
  booktitle = {20th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 23)},
  year      = {2023},
  publisher = {{USENIX} Association},
}

ark's People

Contributors

binyang2014 avatar chhwang avatar microsoftopensource avatar wusar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ark's Issues

[Bug] CK GeMM correctness bug

Describe the bug
test_matmul_fp32 and test_matmul_fp16_split fails on MI300x.

To Reproduce
Run the unit test.

Expected behavior
max_diff should be lower than the calculated value.

System (please complete the following information):

  • MI300x
  • Single-GPU

Additional context

transformer_test.py prompts [AttributeError: 'Model' object has no attribute 'tensor']

ENV: 2 A100s and ran the provided image from ARK version 0.3.0 to 0.5.0.
BUG: When running $python3 transformer_test.py in examples/transformer, it will prompt AttributeError: 'Model' object has no attribute 'tensor'.
e.g. ARK 0.5.0:
Traceback (most recent call last):
File "examples/transformer/transformer_test.py", line 560, in
test_PoswiseFeedForwardNet()
File "examples/transformer/transformer_test.py", line 15, in test_PoswiseFeedForwardNet
input_tensor= model.tensor(

AttributeError: 'Model' object has no attribute 'tensor
Please tell us how to solve this problem, thank you.

ARK v0.4.0 Release Plan (Released)

Timeline

Released Date: Nov. 14th, 2023

Work Items (TBU)

Platforms Support

    • ROCm: add ROCm backend support (#162)

Operators Support

    • [ ] Operator: add int8 operators (dropping this plan for now)
    • Operator: add more AllReduce & AllGather algorithms (#152)

Examples

    • [ ] Example: add Llama2 multi-GPU examples (moved to the next version release)

CI

    • [ ] Unit Tests: revise Python unit tests & add to the Azure pipeline (moved to the next version release)
    • [ ] Code Coverage: add code coverage for Python code (moved to the next version release)

Bug Fix

Action required: migrate or opt-out of migration to GitHub inside Microsoft

Migrate non-Open Source or non-External Collaboration repositories to GitHub inside Microsoft

In order to protect and secure Microsoft, private or internal repositories in GitHub for Open Source which are not related to open source projects or require collaboration with 3rd parties (customer, partners, etc.) must be migrated to GitHub inside Microsoft a.k.a GitHub Enterprise Cloud with Enterprise Managed User (GHEC EMU).

Action

✍️ Please RSVP to opt-in or opt-out of the migration to GitHub inside Microsoft.

❗Only users with admin permission in the repository are allowed to respond. Failure to provide a response will result to your repository getting automatically archived.🔒

Instructions

Reply with a comment on this issue containing one of the following optin or optout command options below.

✅ Opt-in to migrate

@gimsvc optin --date <target_migration_date in mm-dd-yyyy format>

Example: @gimsvc optin --date 03-15-2023

OR

❌ Opt-out of migration

@gimsvc optout --reason <staging|collaboration|delete|other>

Example: @gimsvc optout --reason staging

Options:

  • staging : This repository will ship as Open Source or go public
  • collaboration : Used for external or 3rd party collaboration with customers, partners, suppliers, etc.
  • delete : This repository will be deleted because it is no longer needed.
  • other : Other reasons not specified

Need more help? 🖐️

[Bug]Ark0.4.1 multi_gou_tutorial.py run error

Describe the bug
ark0.4.1: run multi_gou_tutorial.py fail in sched_default.cc
line393 in configure_gpu_buf, tensor.cc line246 in update_pads, errors as follow:
invalid padding detected. This is likely caused because one GPU buffer is used by multiple operators that require different padding. A possible workaround is to let each operator use a different buffer by creating a new tensor rather than overwriting an existing tensor op name:send.

To Reproduce
run multi_gou_tutorial.py in ark0.4.1

Expected behavior

  1. explain why has the error;
  2. what relationship "ldims, type_bytes, tile" between ref_tensor and this_tensor satisfy in updae_pads?

System (please complete the following information):

  • ark0.4.1
  • OS: [e.g. Ubuntu18.04]
  • GPU [A100]
  • Networking Environment [Single-node, Multi-gpu]

Additional context
Add any other context about the problem here.

ARK v0.3.0 New Operators

  • [ ] int8 type support in many operators (#72) moved to the next version plan
  • bfloat16 type support in many operators (#142)
  • embedding: add support (#122)
  • cast: add support (#127)

ARK v0.2.0 Release Plan (Released)

Timeline

Released Date: Sep. 5th, 2023

Work Items

Model

    • Interface: expose the underlying buffer info to Tensor (#79)

Scheduler

    • [ ] Graph Optimization: enable this feature moved to the next version
    • [ ] SimpleScheduler: fix broken features moved to the next version

Communication Stack

    • Interface: hide GpuCommSw implementation from the interface (#81)
    • Interface: extend the current interface (#104)

Operators Support

    • Operator: add more operators (#62)
    • Operator: upgrade CUTLASS (#105)

Python

    • Interface: #96

Examples

    • [ ] Example: add Llama2 example (#102) moved to the next version
    • Example: parallel matmul example (#64)

Bug Fix

Documents

CI

    • Code Coverage: add code coverage (#110)
    • Unit Tests: add a unit test pipeline (#88)
    • Unit Tests: #91

build error

Describe the bug
When I build the resource code by the install document in ubuntu 20.04 , it happens this following errors:

In file included from /root/ark/third_party/mscclpp/src/include/atomic.hpp:9,
                 from /root/ark/third_party/mscclpp/src/fifo.cc:8:
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp: In function ‘T mscclpp::atomicLoad(T*, cuda::__3::memory_order)’:
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:25:16: error: ‘atomic_ref’ is not a member of ‘cuda’
   25 |   return cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.load(memoryOrder);
      |                ^~~~~~~~~~
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:25:28: error: expected primary-expression before ‘,’ token
   25 |   return cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.load(memoryOrder);
      |                            ^
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:25:56: error: expected primary-expression before ‘{’ token
   25 |   return cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.load(memoryOrder);
      |                                                        ^
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:25:56: error: expected ‘;’ before ‘{’ token
   25 |   return cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.load(memoryOrder);
      |                                                        ^
      |                                                        ;
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:25:61: error: expected ‘;’ before ‘}’ token
   25 |   return cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.load(memoryOrder);
      |                                                             ^
      |                                                             ;
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:25:62: error: expected primary-expression before ‘.’ token
   25 |   return cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.load(memoryOrder);
      |                                                              ^
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp: In function ‘void mscclpp::atomicStore(T*, const T&, cuda::__3::memory_order)’:
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:30:9: error: ‘atomic_ref’ is not a member of ‘cuda’
   30 |   cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.store(val, memoryOrder);
      |         ^~~~~~~~~~
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:30:21: error: expected primary-expression before ‘,’ token
   30 |   cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.store(val, memoryOrder);
      |                     ^
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:30:49: error: expected primary-expression before ‘{’ token
   30 |   cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.store(val, memoryOrder);
      |                                                 ^
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:30:55: error: expected primary-expression before ‘.’ token
   30 |   cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.store(val, memoryOrder);
      |                                                       ^
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp: In function ‘T mscclpp::atomicFetchAdd(T*, const T&, cuda::__3::memory_order)’:
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:35:16: error: ‘atomic_ref’ is not a member of ‘cuda’
   35 |   return cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.fetch_add(val, memoryOrder);
      |                ^~~~~~~~~~
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:35:28: error: expected primary-expression before ‘,’ token
   35 |   return cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.fetch_add(val, memoryOrder);
      |                            ^
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:35:56: error: expected primary-expression before ‘{’ token
   35 |   return cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.fetch_add(val, memoryOrder);
      |                                                        ^
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:35:56: error: expected ‘;’ before ‘{’ token
   35 |   return cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.fetch_add(val, memoryOrder);
      |                                                        ^
      |                                                        ;
/root/ark/third_party/mscclpp/include/mscclpp/atomic_device.hpp:35:61: error: expected ‘;’ before ‘}’ token
   35 |   return cuda::atomic_ref<T, cuda::thread_scope_system>{*ptr}.fetch_add(val, memoryOrder);
      |         

System:

  • OS: Ubuntu20.04
  • GPU Geforce 3060 (single)
  • Compiler: g++(9.4), cmake (3.28.1)
  • NVCC: 11.4

[Feature]Can you update the latency comparison between using ark and not using it in existing examples, e.g., Llama demo?

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

ARK v0.5.0 Release Plan (Released)

Timeline

Released Date: Dec. 16th, 2023

Work Items

Major Improvement

    • MSCCL++: integrated with MSCCL++ and removed dependency on gpudma (#179)

Platforms Support

    • ROCm: add ROCm multi-GPU support (#181)

Operators

    • reduce: keepdims support for reduction (#173)

Optimization

    • OpGraph: optimize OpGraph scheduling (#182)

Examples

    • Example: add Llama2 multi-GPU examples (#170)

CI

    • [ ] Unit Tests: revise Python unit tests & add to the Azure pipeline (moved to the next version release plan)
    • [ ] Unit Tests: add ROCm Azure pipelines (moved to the next version release plan)
    • [ ] Code Coverage: add code coverage for Python code (moved to the next version release plan)

ARK v0.3.x Known Bugs & Issues

    • Support both source and destination offsets in NetIbQp::stage_send()
    • Offsets of importing/exporting tensors are not properly handled
    • Use Kahan sum for layernorm (#159)

ARK v0.6.0 Release Plan

Timeline

Expected Release Date: Jan. 16th, 2024 Mar. 31st, 2024

Work Items (TBU)

Performance

    • Computation Kernel: improved vectorization (#189)
    • Communication Kernel: improve performance (#190)

CI

    • Unit Tests: revise Python unit tests & add to the Azure pipeline
    • Unit Tests: add ROCm Azure pipelines
    • Unit Tests: add ROCm 6.0 CI (#188)
    • Code Coverage: add code coverage for Python code

Bug Fix

    • CUTLASS: fix the patch file (#187)
    • OpGraph: a bug fix (#192)

ARK v0.1.0 Known Bugs & Issues

    • Tensor::is_sequential() may need more strict conditions. (#111)
    • Executor::tensor_memcpy_host_to_device() will cause unknown error if the tensors on the host device is not sequential. We need more check about the tensor on the host or mabe need a python warpper for this (#48)
    • Sometime if the tensor is padded, the allgather operation might overwrite the recv tensor, and the allreduce tensor will also be incorrect. (@chhwang: now send/recv checks contiguity)
    • Current layernorm and sofxmax operation is scheduled using a quite hack way, might needs for more update in the future. (#59)
    • ark.init() is not working. (#39)
    • Layernorm need a recv dependency at its output (@chhwang: it already has)
    • ARK environments are not working for Python (#54)
    • [ ] Support both source and destination offsets in NetIbQp::stage_send() moved to the next version
    • Remove a misleading error message:
      if (input->ndims() > 1) {
      LOG(INFO,
      "warning: if the send tensor if not contiguous, the all_gather "
      "may not work correctly");
      }
      (#111)
    • ops_matmul_test.cc is not checking error rates correctly (#91)
    • send_mm and recv_mm are temporarily broken (#52)
    • When using python -m unittest discover -s . -p "test_*.py" to run all unittest, the snedrecv test will fail, but when we run them seperately, their will be no problem. Seems that in some cases the previous runtime context is not destroyed when one unittest finished and another unittest start. This problem also exist in the current main branch. (@chhwang: this is the test code's issue, won't fix for now)
    • matmul test failed for matmul larger than 128, 2048, 1024 (#54)
    • [ ] Offsets of importing/exporting tensors are not properly handled moved to the next version
    • matmul unittesttest failed for test_matmul_transpose (#94)
    • float matmul error rate seems too high but it's unclear if it is ARK's issue or the test code issue (@chhwang: this is not an issue)

ARK v0.3.0 Release Plan (Released)

Timeline

Released Date: Oct. 4th, 2023

Work Items

Interface

    • Python APIs: revise interface
    • Communication: revise send/recv interfaces (#138)

Scheduler

    • Featire: enable heuristic graph optimization (#136)
    • Feature: support mixed precision (#134)
    • [ ] SimpleScheduler: fix broken features plan changed -- deprecate SimpleScheduler

Communication Stack

    • Interface: make send/recv interface simpler (#138)

Operators Support

    • Operator: add more operators (#107)

Examples

    • Example: add Llama2 example (#121)

CI

    • Code Coverage: improve coverage (#119)

Bug Fix

  • Fix install bugs (#116)
  • Fix a batched matmul bug (#117)
  • Fix a padded matmul bug (#129)
  • Fix incorrect kernels
  • Fix a Python 3.11 installation issue (#135)
  • #112

ARK v0.2.0 New Operators

  • matmul: support float (#67)
  • reduce: support float (#67)
  • sub: add support (#73)
  • div: add support (#73)
  • sqrt: add support (#73)
  • exp: add support (#73)
  • sigmoid: add support (#73)
  • relu: support more types (#73)
  • gelu: support more types (#73)
  • [ ] int8 type support in many operators (#72) moved to the next version release
  • rmsnorm: add support (#95)
  • rope: add support (#95)
  • scale: support float (#95)

ARK v0.2.x Known Bugs & Issues

    • Support both source and destination offsets in NetIbQp::stage_send()
    • Offsets of importing/exporting tensors are not properly handled
    • Segfault when a model uses many SIDs (#115)
    • Use Kahan sum for layernorm

why RTX3060 with Ampere architecture is not supported?

For ark v0.5.0, the file named docs/install.md shows all supported NVIDIA GPU include "Volta (CUDA >= 11.1) / Ampere (CUDA >= 11.1) / Hopper (CUDA >= 12.0)", but my RTX3060 (Ampere) with capability 8.6 cannot work with ark. Why? I also found below code in the file ark/ops/ops_common.cc, it means only GPU with capability 6.0/7.0/8.0/9.0 are supported, right?

pArchType op_arch_from_string(const std::string &arch) {
if (arch == "cuda_60") {
return OP_ARCH_CUDA_60;
} else if (arch == "cuda_70") {
return OP_ARCH_CUDA_70;
} else if (arch == "cuda_80") {
return OP_ARCH_CUDA_80;
} else if (arch == "cuda_90") {
return OP_ARCH_CUDA_90;
} else if (arch == "rocm_90a") {
return OP_ARCH_ROCM_90A;
} else if (arch == "rocm_942") {
return OP_ARCH_ROCM_942;
}
return OP_ARCH_UNKNOWN;
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.