Giter Site home page Giter Site logo

modular-ml / wrapyfi-examples_llama Goto Github PK

View Code? Open in Web Editor NEW

This project forked from meta-llama/llama

129.0 129.0 13.0 62 KB

Inference code for facebook LLaMA models with Wrapyfi support

License: GNU General Public License v3.0

Shell 5.78% Python 92.36% Dockerfile 1.87%

wrapyfi-examples_llama's People

Contributors

fabawi avatar glample avatar guangyusong avatar timlacroix avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

wrapyfi-examples_llama's Issues

What's the total WPS when i use multi-gpu to work for inference?

GPU ID | Type CPU Mem. | Power GPU Mem. WPS
0 | TITAN Xp 12GB 2.4 GB | 79 W / 250 W 5.6 GB 12
1 | TITAN Xp 12GB 1.3 GB | 63 W / 250 W 5.6 GB 12
2 | TITAN X 12GB 1.3 GB | 89 W / 250 W 5.5 GB 12
3 | TITAN X 12GB 1.3 GB | 99 W / 250 W 6.2 GB 12

I've tried to use multi-gpu(here is 4) for inference with 7B params, but i found the GPU util is very lower than single GPU. It seems wrapyfi with llama can't fully utilize the performance of GPU.
For the performance mentioned in the README, what's the total WPS should I use? 12 or 4x12?

Running on CPUs?

This isnt really an issue, but im trying to find a method to link multiple mobile/laptop devices together to piggyback of each CPU essentially. Is it doable with this fork? Any suggestions and tips would be welcome!

Running a REST API instead of "example.py"

I see that "example.py" iteratively generates prompt answers on both machines (or instances partially loaded layers onto GPUs). Is there any possibility that I utilize multiple GPUs on multiple machines in order to deploy a rest api service, so that I can send prompts as request from other services??

Cpu/memory requirements for worker nodes

This is exactly what I'm looking for to extend my existing cluster that is high cpu/RAM AND 0 GPU.
Can you give some insight if the workers can run on low cpu/ram systems, such as a series of rpi 5 with RTX 4090 over 1x picie, while the master processes checkpoint reallocation using high cpu/ram capacity?
Also is a gigabit cluster network sufficient to relay MQ messages between workers

Model Parallel Question

Did you change the Model Parallel(MP) value for 7B? I think they did tensor parallel and may require to modify the model to match the MP with number GPUs

where is zeromq_proxy_broker.py and dir standalone?

The examples in readme confused me, where is zeromq_proxy_broker.py file and standalone dir?

Replace all occurances of <YOUR_IP> and <YOUR_CHECKPOINT_DIRECTORY> before running the scripts

Start the Wrapyfi ZeroMQ broker from within the Wrapyfi repo:

cd wrapyfi/standalone
python zeromq_proxy_broker.py --comm_type pubsubpoll
Start the first instance of the Wrapyfi-wrapped LLaMA from within this repo and env (order is important, dont start wrapyfi_device_idx=0 before wrapyfi_device_idx=1):
CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR_CHECKPOINT_DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR_CHECKPOINT_DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1 --wrapyfi_total_devices 2

2PC with 4GPUs each

I have 2 machines, with 4 GPUs each
on 192.168.2.14, I ran
CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='192.168.2.14' torchrun --nproc_per_node 4 example.py --ckpt_dir /storage/workplace/share/llama_models/65B --tokenizer_path ./tokenizer.model --wrapyfi_device_idx 1 --wrapyfi_total_devices 2
and report

tensor([[[0, 1, 2, 3]]])
tensor([[[0, 1, 2, 3]]])
> initializing model parallel with size 4
> initializing ddp with size 1
> initializing pipeline with size 1
tensor([[[0, 1, 2, 3]]])
tensor([[[0, 1, 2, 3]]])
Traceback (most recent call last):
  File "example.py", line 125, in <module>
    fire.Fire(main)
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "example.py", line 78, in main
    local_rank, world_size = setup_model_parallel()
  File "example.py", line 25, in setup_model_parallel
    torch.cuda.set_device(local_rank)
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/torch/cuda/__init__.py", line 313, in set_device
Traceback (most recent call last):
  File "example.py", line 125, in <module>
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
    fire.Fire(main)
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "example.py", line 78, in main
    local_rank, world_size = setup_model_parallel()
  File "example.py", line 25, in setup_model_parallel
    torch.cuda.set_device(local_rank)
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/torch/cuda/__init__.py", line 313, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "example.py", line 125, in <module>
    fire.Fire(main)
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "example.py", line 78, in main
    local_rank, world_size = setup_model_parallel()
  File "example.py", line 25, in setup_model_parallel
    torch.cuda.set_device(local_rank)
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/torch/cuda/__init__.py", line 313, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "example.py", line 125, in <module>
    fire.Fire(main)
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "example.py", line 82, in main
    generator = load(
  File "example.py", line 44, in load
    assert world_size == len(
AssertionError: Loading a checkpoint for MP=8 but world size is 4
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1426263) of binary: /home/winner/anaconda3/envs/py38_pt1110/bin/python
Traceback (most recent call last):
  File "/home/winner/anaconda3/envs/py38_pt1110/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-03-13_11:50:12
  host      : localhost.localdomain
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1426264)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-03-13_11:50:12
  host      : localhost.localdomain
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 1426265)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-03-13_11:50:12
  host      : localhost.localdomain
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 1426266)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-13_11:50:12
  host      : localhost.localdomain
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1426263)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Error when run torch.load(ckpt_path, map_location="cpu")

Hi there,

I have downloaded the LLaMA models, but when I try to load the model, I got the error:
RuntimeError: PytorchStreamReader failed reading file data/2: invalid header or archive is corrupted

My PyTorch version is 1.13.1.
Has the model version updated? My download files look like this:
Screenshot 2023-05-03 at 11 26 08 AM

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.