openhackathons-org / end-to-end-llm Goto Github PK

This repository is an AI Bootcamp material that consist of a workflow for LLM

License: Apache License 2.0

Jupyter Notebook 55.73% Shell 6.22% Python 37.79% HTML 0.26% JavaScript 0.01%

deep-learning natural-language-processing p-tuning prompt-tuning nemo-megatron llm nemo-guardrails question-answering tensorrt-llm genai

end-to-end-llm's Introduction

End-to-End LLM Bootcamp

The End-to-End LLM (Large Language Model) Bootcamp is designed from a real-world perspective that follows the data processing, development, and deployment pipeline paradigm. Attendees walk through the workflow of preprocessing the SQuAD (Stanford Question Answering Dataset) dataset for Question Answering task, training the dataset using BERT (Bidirectional Encoder Representations from Transformers), and executing prompt learning strategy using NVIDIA® NeMo™ and a transformer-based language model, NVIDIA Megatron. Attendees will also learn to optimize an LLM using NVIDIA TensorRT™, an SDK for high-performance deep learning inference, guardrail prompts and responses from the LLM model using NeMo Guardrails, and deploy the AI pipeline using NVIDIA Triton™ Inference Server, an open-source software that standardizes AI model deployment and execution across every workload.

Bootcamp Content

This content contains three Labs, plus an introductory notebook and two lab activities notebooks:

Overview of End-To-End LLM bootcamp
Lab 1: Megatron-GPT
Lab 2: TensorRT-LLM and Triton Deployment with LLama-2-7B Model
Lab 3: NeMo Guardrails
Lab Activity 1: Question Answering task
Lab Activity 2: P-tuning/Prompt tuning task

Tools and Frameworks

The tools and frameworks used in the Bootcamp material are as follows:

Tutorial duration

The total Bootcamp material would take approximately 8 hours and 45 minutes. We recommend dividing the material's teaching into two days, covering Lab 1 in one session and the rest in the next session.

Deploying the Bootcamp Material

To deploy the Labs, please refer to the Deployment guide presented here

Attribution

This material originates from the OpenHackathons Github repository. Check out additional materials here

Don't forget to check out additional Open Hackathons Resources and join our OpenACC and Hackathons Slack Channel to share your experience and get more help from the community.

Licensing

Copyright © 2023 OpenACC-Standard.org. This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials may include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.

end-to-end-llm's People

Contributors

Stargazers

Watchers

Forkers

programmah muntasers mallasravya mierzejk-private aswkumar99 benelgiz motahoun sekalski edmondium martynabaran keshavaspanda vnsavitri cheng111 btrungvo tommylinkl mardom

end-to-end-llm's Issues

Feature Request - Addition of Benchmarking for TRT-LLM

The current TRT-LLM Materials discusses the Hands-on aspects of getting from a Model to Deployment in a Triton server.

Given that TRT-LLM focuses on Performance, we could have a section that discusses the performance aspects of TRT-LLM and the various optimisations that are available to the end user.

Issue: Deployment guide is outdated or incorrect

Deployment guide is stating the following:

_When you are inside the container, launch jupyter lab: jupyter-lab --no-browser --allow-root --ip=0.0.0.0 --port=8888 --NotebookApp.token="" --notebook-dir=/workspace.

Open the browser at http://localhost:8888 and click on the Start_here.ipynb notebook_

But when building the container there is no actual Start_here.ipynb (unless you go to archived/workspace, which indicates me that it is either deprecated or not well defined where i should look for the notebook).

Issue: NeMo container library issues and Start_Here.ipynb links conflict issues for different containers

NeMo container issues:

Unable to download dataset due to gdown library issue. The gdown library requires an upgrade within the nemo container.
Unable to connect to the server with NeMo-LLM service. To solve the issue, the NeMO guardrail library requires an upgrade within the container.

Start_Here.ipynb links conflict issues for different containers:

Users sometimes click on labs that run on different containers and get errors. To avoid this issue, a separate Start_Here.ipynb notebooks for each lab should be created.

Issue: Nemo_primer.ipynb imports not working.

In Nemo_primer.ipynb when doing import nemo.collections.asr as nemo_asr, import nemo.collections.nlp as nemo_nlp and
import nemo.collections.tts as nemo_tts I get the following error
ImportError: tokenizers>=0.11.1,!=0.11.3,<0.14 is required for a normal functioning of this module, but found tokenizers==0.15.2.

If I try to solve it by doing pip install tokenizers==0.13.1 I get this other error
File /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/utilities.py:25
17 def _get_gpu_memory_map() -> None:
18 # TODO: Remove in v2.0.0
19 raise RuntimeError(
20 "pytorch_lightning.utilities.memory.get_gpu_memory_map was deprecated in v1.5 and is no longer supported"
21 " as of v1.9. Use pytorch_lightning.accelerators.cuda.get_nvidia_gpu_stats instead."
22 )
---> 25 pl.utilities.memory.get_gpu_memory_map = _get_gpu_memory_map

AttributeError: partially initialized module 'pytorch_lightning' has no attribute 'utilities' (most likely due to a circular import)

It might be helpful to specify the desired package versions in the pip install inside the Dockerfile_nemo because it might be that
doing
RUN pip install lightning RUN pip install megatron.core RUN pip install --upgrade nemoguardrails RUN pip install openai RUN pip install ujson RUN pip install --upgrade --no-cache-dir gdown
is installing new and uncompatible versions of the libraries (I mean uncompatibles with the tutorials showed in the notebooks).

Feature Request - Workflow for Integration of New models with TRT-LLM

TRT-LLM does a great job in optimising the supported set of models. But a Notebook/ Section discussing the workflow and steps to integrate a custom model would be very helpful for custom integrations.

Issue: 98 - Address already in use, Unable to download MegatronGPT 1.3B, and Triton Server issue

Issues with downloading the MegatronGPT 1.3B model from google drive which cause a delay running the lab activity 2 notebook. Google drive restrict permission when it sense multiple download request. The solution is to download the files ahead before mounting workspace into the container
“errno: 98 - Address already in use” error when running the trainer.fit() cell within the Prompt/p-tuning notebook. The solution is to set the DDP port to something else before trainer.fit (eg: os.environ['MASTER_PORT'] = PORT + )
Triton Server error ”mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated ” due several jobs on the same nodes. The issue can be resolved by modifying a line in the launch_triton_server.py script, from:

cmd += ' -n 1 {} --model-repository={} --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix{} : '.format(tritonserver, model_repo, i)_

cmd += ' -n 1 {} --model-repository={} --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix{} : '.format( tritonserver, model_repo, str(i)+os.environ['USER'])_

Feature Request: Fine-tune Llama-2-7B with Custom Dataset

This feature request is required as part of an End-to-End pipeline. The process should include:

dataset preprocessing
use of PEFT method to fine-tune llama-2-7b for text generation task
Fine-tuned and based model merging
inferencing

Feature Request - Triton Server Deployment - Hands-on latency and throughput comparison across two models

An important aspect of deployment would be that the model needs to be served to a wide range of users. Understanding the throughout and latency and comparison with additional optimisation to the Vanilla deployment could be helpful to get a better picture of the Deployment requirements and perspective.

Feature Request: Building TensorRT Engine for Finetuned Llama-2-7B Model

The feature request is base on the use of TRT-LLM to build a tenssorrt engine from a finetuned llama-2-7b model.
Expatiate on the process built process
Exemplify vanilla optimization process

Issue: Many unnecessary files and folders within the NeMo Guardrails lab

Many unnecessary files and folders are included within the NeMo Guardrails lab, making navigation within the lab difficult. The lab should not have the entire clone repository but a folder containing only needed files, folders, and notebooks. The Deployment_Guide.md file should explicitly state the type of services and requirements (openai and nemo llm service) to run the lab.

Feature Request: Validating prompt response from Triton server using NeMo Guardrails

This feature request is about creating a content that demonstrate how to connect nemo guardrails to Llama-2-7b-chat TensorRT engine deployed on Triton Inference Server. This approach helps avoid the need for an Openai key and bypass NeMo-LLM Service when using NeMo guardrails to guard user prompts to/from the deployed model. You can use the LangChain framework to achieve the task.
The feature is required to complete the End-to-End LLM pipeline.