llm-efficiency-challenge / neurips_llm_efficiency_challenge Goto Github PK

View Code? Open in Web Editor NEW

244.0 16.0 56.0 222 KB

NeurIPS Large Language Model Efficiency Challenge: 1 LLM + 1GPU + 1Day

Dockerfile 5.56% Python 94.44%

neurips_llm_efficiency_challenge's People

Contributors

Stargazers

Watchers

Forkers

sabyaghossh teja1729 carmocca will-thompson-k chitimbwasc sayanc93 gregbowyer apollohuang1 wangxidong06 rasbt evdcush scsteffen harounh bousejin ankit1063 soumya1729 zxkyjimmy essamrafie pradeepmohansml djayalath anna22042001 andoorve rambits-ai aniketmaurya mreso sudomishra jordine perlitz kl199601 xindi-dumbledore jeromeku junumoon rghosh08 siddheshmhatre raghavan1988 jeanpetitt das-projects laudrup21 riaz fyzl233 shushengyuan preetl jeffxtang ranchlai ycchen-tw rafael-ariascalles seunghwan1228 balakreshnan techthiyanes gauravcodepro mouwumou jibbscript wingtangwong

neurips_llm_efficiency_challenge's Issues

I have a question about "build_run_specs_full.py" file.

https://github.com/llm-efficiency-challenge/neurips_llm_efficiency_challenge/blob/master/build_run_specs_full.py

Hello, I have a question in that file!

No other than that, there seem to be a total of 3 files that can proceed with the helm benchmark test, and I wonder if the final evaluation is done with that file, or with the file "run_specs_full_coase_600_budge.conf"!

Thank you.

Support batch size argument in toy submission

Toy submission issues. Incorrect file path?

Hey, I am trying to run the tutorial at https://github.com/llm-efficiency-challenge/neurips_llm_efficiency_challenge/tree/master/sample-submissions/lit-gpt

It doesn't say so, but if I want to make a Lit-GPT submission, I have to cd into /neurips_llm_efficiency_challenge/sample-submissions/lit-gpt because that's where the Dockerfile is?

Now, when running the docker build step, I am getting the following error:

 master ~/neurips_llm_efficiency_challenge/sample-submissions/lit-gpt docker build -t toy_submission .
[+] Building 2.5s (12/16)                                                                     docker:default
 => [internal] load .dockerignore                                                                       0.0s
 => => transferring context: 2B                                                                         0.0s
 => [internal] load build definition from Dockerfile                                                    0.0s
 => => transferring dockerfile: 1.52kB                                                                  0.0s
 => [internal] load metadata for ghcr.io/pytorch/pytorch-nightly:c69b6e5-cu11.8.0                       0.1s
 => [ 1/12] FROM ghcr.io/pytorch/pytorch-nightly:c69b6e5-cu11.8.0@sha256:748628fda7661f7e0612299b2012c  0.0s
 => [internal] load build context                                                                       0.0s
 => => transferring context: 19.21kB                                                                    0.0s
 => CACHED [ 2/12] WORKDIR /submission                                                                  0.0s
 => CACHED [ 3/12] COPY /lit-gpt/ /submission/                                                          0.0s
 => CACHED [ 4/12] COPY ./fast_api_requirements.txt fast_api_requirements.txt                           0.0s
 => CACHED [ 5/12] RUN pip install --no-cache-dir --upgrade -r fast_api_requirements.txt                0.0s
 => CACHED [ 6/12] RUN apt-get update && apt-get install -y git                                         0.0s
 => CACHED [ 7/12] RUN pip install -r requirements.txt huggingface_hub sentencepiece                    0.0s
 => ERROR [ 8/12] RUN python scripts/download.py --repo_id openlm-research/open_llama_3b                2.3s

Trying the last step manually, it gives me an error that the scripts/download.py file doesn't exist. Shouldn't it be lit-gpt/scripts/download.py instead?

RuntimeError: [enforce fail at inline_container.cc:471] . PytorchStreamWriter failed writing file data/151: file write failed

Following the README, I tried building up the Docker image. But, get the following error in the weights conversion step.

 => ERROR [10/13] RUN python scripts/convert_hf_checkpoint.py --checkpoint_dir checkpoints/open-llama/7B --model_size 7B                                                       66.8s
------
 > [10/13] RUN python scripts/convert_hf_checkpoint.py --checkpoint_dir checkpoints/open-llama/7B --model_size 7B:
65.78 Initializing lit-llama
65.78 Saving to disk at checkpoints/lit-llama/7B
65.78 Processing checkpoints/open-llama/7B/pytorch_model-00001-of-00002.bin
65.78 Traceback (most recent call last):
65.78   File "/submission/scripts/convert_hf_checkpoint.py", line 136, in convert_hf_checkpoint
65.78     sd[sd_key] = saver.store_early(sd[sd_key])
65.78   File "/submission/lit_llama/utils.py", line 469, in store_early
65.78     return SavingProxyForTensor(tensor, self)
65.78   File "/submission/lit_llama/utils.py", line 387, in __init__
65.78     storage_proxy = SavingProxyForStorage(
65.78   File "/submission/lit_llama/utils.py", line 363, in __init__
65.78   File "/submission/lit_llama/utils.py", line 492, in _write_storage_and_return_key
65.78     storage_key = saver._write_storage_and_return_key(storage)
65.79     self.zipfile.write_record(name, storage.data_ptr(), num_bytes)
65.79 RuntimeError: [enforce fail at inline_container.cc:471] . PytorchStreamWriter failed writing file data/151: file write failed
65.79
65.79 During handling of the above exception, another exception occurred:
65.79
65.79 Traceback (most recent call last):
65.79   File "/submission/scripts/convert_hf_checkpoint.py", line 166, in <module>
65.79     CLI(convert_hf_checkpoint)
65.79   File "/opt/conda/lib/python3.10/site-packages/jsonargparse/_cli.py", line 85, in CLI
65.79     return _run_component(component, cfg_init)
65.79   File "/opt/conda/lib/python3.10/site-packages/jsonargparse/_cli.py", line 147, in _run_component
65.79     return component(**cfg)
65.79   File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
65.79     return func(*args, **kwargs)
65.79   File "/submission/scripts/convert_hf_checkpoint.py", line 88, in convert_hf_checkpoint
65.79     with incremental_save(output_dir / "lit-llama.pth") as saver:
65.79   File "/submission/lit_llama/utils.py", line 496, in __exit__
65.79     self.zipfile.write_end_of_file()
65.79 RuntimeError: [enforce fail at inline_container.cc:337] . unexpected pos 17973300160 vs 17973300056
------
Dockerfile:22
--------------------
  20 |     # get open-llama weights
  21 |     RUN python scripts/download.py --repo_id openlm-research/open_llama_7b --local_dir checkpoints/open-llama/7B
  22 | >>> RUN python scripts/convert_hf_checkpoint.py --checkpoint_dir checkpoints/open-llama/7B --model_size 7B
  23 |
  24 |     # Copy over single file server
--------------------
ERROR: failed to solve: process "/bin/sh -c python scripts/convert_hf_checkpoint.py --checkpoint_dir checkpoints/open-llama/7B --model_size 7B" did not complete successfully: exit code: 1

ERROR: failed to solve: failed to read dockerfile: open /var/lib/docker/tmp/buildkit-mount3987352626/Dockerfile: no such file or directory

Ref: https://github.com/llm-efficiency-challenge/neurips_llm_efficiency_challenge/tree/master/toy-submission#build-and-run

$ docker build -t toy_submission .

[+] Building 0.1s (2/2) FINISHED                                                                                                                                      docker:default
 => [internal] load build definition from Dockerfile                                                                                                                            0.0s
 => => transferring dockerfile: 2B                                                                                                                                              0.0s
 => [internal] load .dockerignore                                                                                                                                               0.0s
 => => transferring context: 2B                                                                                                                                                 0.0s
ERROR: failed to solve: failed to read dockerfile: open /var/lib/docker/tmp/buildkit-mount3987352626/Dockerfile: no such file or directory

cc: @carmocca

NeurIPS Competitions Blog Post links to wrong page

Hi!

I just wanted to point out that the NeurIPS Competitions blog post still links to https://weiweiy.github.io/, which forwards to https://kaleab-k.github.io/LLMEC/index instead of https://llm-efficiency-challenge.github.io/.

Best,
Harald

Remove multilingual tasks

I wonder if you might consider getting rid of the conlang_translation task. Doing multilingual NLP is a whole different thing, and it's useful, but if everything else is English, then having one translation task seems like a distraction.

...And maybe language_identification. That's already so easy and effective with fasttext-style models anyway, that it seems like a low-priority thing for people to be working on with LLMs.

AFAICT removing these two would mean people can focus on English tasks. There's already a lot to do, so this would make life easier!
cc @weiweiy, @artidoro and @perlitz as requested by @msaroufim

lit-gpt error with quantize lib

I got this error in block code

from quantize.bnb import Linear4bit

class QuantizedLinear(Linear4bit):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, quant_type="nf4", compress_statistics=True, **kwargs)

but, how to fix?

Error while dowloading 'checkpoints/openlm-research/open_llama_3b'

I get error about dowloading checkpoint when running command docker build -t toy_submission . for sample solution

How can I fix? Help me.

Unable to install helm

Hi, I am not able to run the below command:

pip install git+https://github.com/drisspg/helm.git@neruips_client

below is the snapshot of error:

any workaround for this?

Should we migrate the starter kit from lit-llama to lit-gpt

Ref: https://github.com/Lightning-AI/lit-llama

The open-source code in this repository works with the original LLaMA weights that are distributed by Meta under a research-only license.

New Apache 2.0 licensed weights are being released as part of the Open LLaMA project. To use the Open LLaMA weights or other LLaMA-like checkpoints such as Vicuna, check out the Lit-GPT repository.

Missing dependencies in Sample Submission

In the toy submission docs here, I think you'd also need to first install the dependencies, e.g., via pip install -r requirements.txt from the lit-gpt subfolder before running docker build -t toy_submission .
Or perhaps this should be added as a step to the Dockerfile (but I don't know enough about Docker).

Allow users to be able to change batch size during evaluation

Could we allow users to be able to choose the batch size when doing evaluation? It is important to be able to change batch size when experimenting with different sized models on GPU. @drisspg

Possible bug in /process endpoint

In the line here, the size of the iterables included in ZIP does not match. More specifically, the tokens includes the logprob of all the tokens (ie input prompt + model response) whereas logprobs corresponds only to the new tokens generated ie model response. As a result, I believe the generated_tokens variable has incorrect values stored in it.
Can someone look into it/confirm this ?

\c @drisspg @rasbt

Test HELM locally on gsm8k Request error

I deployed HELM according to helm.md. When I use the following conf it is normal
entries: [{description: "mmlu:model=neurips/local,subject=college_computer_science", priority: 4}]
But when I use
entries: [{description: "gsm:model=neurips/local", priority: 4}]
there will be the following prompt

"/lit-gpt": not found when build docker

ubuntu@146-235-201-180:~/neurips_llm_efficiency_challenge/sample-submissions/lit-gpt$ sudo docker build -t toy_submission .
[+] Building 0.5s (7/16)                                                                                                                                                   
 => [internal] load build definition from Dockerfile                                                                                                                  0.0s
 => => transferring dockerfile: 1.38kB                                                                                                                                0.0s
 => [internal] load .dockerignore                                                                                                                                     0.0s
 => => transferring context: 2B                                                                                                                                       0.0s
 => [internal] load metadata for ghcr.io/pytorch/pytorch-nightly:c69b6e5-cu11.8.0                                                                                     0.5s
 => [internal] load build context                                                                                                                                     0.0s
 => => transferring context: 128B                                                                                                                                     0.0s
 => [ 1/12] FROM ghcr.io/pytorch/pytorch-nightly:c69b6e5-cu11.8.0@sha256:748628fda7661f7e0612299b2012ca3a9407ac920ea791398f9d553de8a43380                             0.0s
 => CACHED [ 2/12] WORKDIR /submission                                                                                                                                0.0s
 => ERROR [ 3/12] COPY /lit-gpt/ /submission/                                                                                                                         0.0s
------
 > [ 3/12] COPY /lit-gpt/ /submission/:
------
Dockerfile:10
--------------------
   8 |     
   9 |     # Copy the specific file into the container at /submission
  10 | >>> COPY /lit-gpt/ /submission/
  11 |     
  12 |     # Setup server requriements
--------------------
ERROR: failed to solve: failed to compute cache key: failed to calculate checksum of ref moby::de7206o7kdfxfqkatsgfjhmw2: "/lit-gpt": not found

I followed all instructions in README.md, but I got the above error.

Does dockerfile need to be modified?