modular-ml / wrapyfi-examples_llama Goto Github PK
View Code? Open in Web Editor NEWThis project forked from meta-llama/llama
Inference code for facebook LLaMA models with Wrapyfi support
License: GNU General Public License v3.0
This project forked from meta-llama/llama
Inference code for facebook LLaMA models with Wrapyfi support
License: GNU General Public License v3.0
GPU ID | Type | CPU Mem. | Power | GPU Mem. | WPS |
---|---|---|---|
0 | TITAN Xp 12GB | 2.4 GB | 79 W / 250 W | 5.6 GB | 12 |
1 | TITAN Xp 12GB | 1.3 GB | 63 W / 250 W | 5.6 GB | 12 |
2 | TITAN X 12GB | 1.3 GB | 89 W / 250 W | 5.5 GB | 12 |
3 | TITAN X 12GB | 1.3 GB | 99 W / 250 W | 6.2 GB | 12 |
I've tried to use multi-gpu(here is 4) for inference with 7B params, but i found the GPU util is very lower than single GPU. It seems wrapyfi with llama can't fully utilize the performance of GPU.
For the performance mentioned in the README, what's the total WPS should I use? 12 or 4x12?
This isnt really an issue, but im trying to find a method to link multiple mobile/laptop devices together to piggyback of each CPU essentially. Is it doable with this fork? Any suggestions and tips would be welcome!
I see that "example.py" iteratively generates prompt answers on both machines (or instances partially loaded layers onto GPUs). Is there any possibility that I utilize multiple GPUs on multiple machines in order to deploy a rest api service, so that I can send prompts as request from other services??
This is exactly what I'm looking for to extend my existing cluster that is high cpu/RAM AND 0 GPU.
Can you give some insight if the workers can run on low cpu/ram systems, such as a series of rpi 5 with RTX 4090 over 1x picie, while the master processes checkpoint reallocation using high cpu/ram capacity?
Also is a gigabit cluster network sufficient to relay MQ messages between workers
Did you change the Model Parallel(MP) value for 7B? I think they did tensor parallel and may require to modify the model to match the MP with number GPUs
The examples in readme confused me, where is zeromq_proxy_broker.py file and standalone dir?
Replace all occurances of <YOUR_IP> and <YOUR_CHECKPOINT_DIRECTORY> before running the scripts
Start the Wrapyfi ZeroMQ broker from within the Wrapyfi repo:
cd wrapyfi/standalone
python zeromq_proxy_broker.py --comm_type pubsubpoll
Start the first instance of the Wrapyfi-wrapped LLaMA from within this repo and env (order is important, dont start wrapyfi_device_idx=0 before wrapyfi_device_idx=1):
CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 torchrun --nproc_per_node 1 example.py --ckpt_dir <YOUR_CHECKPOINT_DIRECTORY>/checkpoints/7B --tokenizer_path <YOUR_CHECKPOINT_DIRECTORY>/checkpoints/tokenizer.model --wrapyfi_device_idx 1 --wrapyfi_total_devices 2
I have 2 machines, with 4 GPUs each
on 192.168.2.14, I ran
CUDA_VISIBLE_DEVICES="0" OMP_NUM_THREADS=1 WRAPYFI_ZEROMQ_SOCKET_IP='192.168.2.14' torchrun --nproc_per_node 4 example.py --ckpt_dir /storage/workplace/share/llama_models/65B --tokenizer_path ./tokenizer.model --wrapyfi_device_idx 1 --wrapyfi_total_devices 2
and report
tensor([[[0, 1, 2, 3]]])
tensor([[[0, 1, 2, 3]]])
> initializing model parallel with size 4
> initializing ddp with size 1
> initializing pipeline with size 1
tensor([[[0, 1, 2, 3]]])
tensor([[[0, 1, 2, 3]]])
Traceback (most recent call last):
File "example.py", line 125, in <module>
fire.Fire(main)
File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "example.py", line 78, in main
local_rank, world_size = setup_model_parallel()
File "example.py", line 25, in setup_model_parallel
torch.cuda.set_device(local_rank)
File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/torch/cuda/__init__.py", line 313, in set_device
Traceback (most recent call last):
File "example.py", line 125, in <module>
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
fire.Fire(main)
File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "example.py", line 78, in main
local_rank, world_size = setup_model_parallel()
File "example.py", line 25, in setup_model_parallel
torch.cuda.set_device(local_rank)
File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/torch/cuda/__init__.py", line 313, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "example.py", line 125, in <module>
fire.Fire(main)
File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "example.py", line 78, in main
local_rank, world_size = setup_model_parallel()
File "example.py", line 25, in setup_model_parallel
torch.cuda.set_device(local_rank)
File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/torch/cuda/__init__.py", line 313, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "example.py", line 125, in <module>
fire.Fire(main)
File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "example.py", line 82, in main
generator = load(
File "example.py", line 44, in load
assert world_size == len(
AssertionError: Loading a checkpoint for MP=8 but world size is 4
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1426263) of binary: /home/winner/anaconda3/envs/py38_pt1110/bin/python
Traceback (most recent call last):
File "/home/winner/anaconda3/envs/py38_pt1110/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/winner/anaconda3/envs/py38_pt1110/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-03-13_11:50:12
host : localhost.localdomain
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1426264)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2023-03-13_11:50:12
host : localhost.localdomain
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 1426265)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2023-03-13_11:50:12
host : localhost.localdomain
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 1426266)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-03-13_11:50:12
host : localhost.localdomain
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1426263)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.