Giter Site home page Giter Site logo

yeonwoosung / mlops Goto Github PK

View Code? Open in Web Editor NEW
8.0 4.0 0.0 479.4 MB

Miscellaneous codes and writings for MLOps

License: GNU General Public License v3.0

JavaScript 0.14% HTML 0.25% Shell 0.09% Jupyter Notebook 97.01% Python 2.46% Dockerfile 0.02% Makefile 0.01% Gherkin 0.01% TypeScript 0.01% Java 0.01% Batchfile 0.01% CSS 0.01% PLpgSQL 0.01%
ai ai-as-a-service aws llm llm-inference llm-ops ml-serving mlops multimodal bentoml triton-inference-server apache-iceberg data-intensive-applications docker kubernetes spark spark-nlp rag vector-database vectordb

mlops's Introduction

mlops's People

Contributors

dependabot[bot] avatar yeonwoosung avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

mlops's Issues

How Meta trains large language models at scale

meta engineering blog post

  • Meta requires massive computational power to train large language models (LLMs)
  • Traditional AI model training trains a large number of models, but requires a relatively small number of GPUs
  • With the advent of generative AI (GenAI), fewer tasks are required, but they are very large tasks.

Challenges of training large-scale models

  • Hardware reliability: Requires rigorous testing and quality control to minimize training disruption due to hardware failure.
  • Fast recovery in case of failure: need to be able to recover quickly when hardware failures occur. Reduced rescheduling overhead and fast training reinitialization required.
  • Efficient preservation of training state: Need to be able to efficiently save and recover training state in the event of a failure.
  • Optimal connectivity between GPUs: Data transfer between GPUs is critical for large-scale model training. This requires high-speed network infrastructure and efficient data transfer protocols.

Improving all layers of the infrastructure stack is critical

Training software

  • Enable researchers to quickly move from research to production using open source like PyTorch.
  • Developing new algorithms and techniques for large-scale training and integrating new software tools and frameworks.

Scheduling

  • Allocating and dynamically scheduling resources based on the needs of the job, using complex algorithms to optimize resources.

Hardware

  • Requires high-performance hardware to handle large-scale model training.
  • Optimized existing hardware and modified the Grand Teton platform with NVIDIA H100 GPUs, increasing the TDP of the GPUs to 700W and switching to HBM3.

Data Center Placement

  • Optimized resources (power, cooling, networking, etc.) by optimally placing GPUs and systems in the data center.
  • We deployed as many GPU racks as possible for maximum compute density.

Reliability

  • Detection and recovery plans in place to minimize downtime in the event of hardware failure.
  • Common failure modes: GPU unrecognized, DRAM & SRAM UCE, hardware network cable issues.

Network

  • High-speed network infrastructure and efficient data transfer protocols are required for large-scale model training.
  • Built two network clusters, RoCE and InfiniBand, to learn from operational experience.

Storage

  • Invested in high-capacity, high-speed storage technologies for large-scale data storage and developed new data storage solutions for specific tasks.

Looking ahead

  • We will use hundreds of thousands of GPUs to process more data and cover longer distances and latencies.
  • We plan to adopt new hardware technologies and GPU architectures and evolve our infrastructure.
  • We will explore the evolving landscape of AI and strive to push the boundaries of what is possible.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.