it does start without hiccups it feels slower for quick tweaks than the

(cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Sent <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-i

no, i'm not sure what's going on. The old s would automatically set the runid to

Random thoughts on the new docker stuff about levanter HOT 15 CLOSED

dlwh commented on August 15, 2024

Random thoughts on the new docker stuff

from levanter.

Comments (15)

dlwh commented on August 15, 2024

(cc @rjpower I'll keep editing as i figure things out etc.)

from levanter.

rjpower commented on August 15, 2024

Good ideas. Yeah I think docker just wants a pty for the statusbar to show up. We can do something with pty.spawn instead of subprocess.check_output to make that better.

Eyeballing it now, I'd say about half the time is redudant work we do now to setup the repo and VM that we can skip (or just merge some commands together). I thought about doing this before but was running into issues where you change things a little bit and now something is stale. I can probably just add a --force_init_vm flag to help with this.

I agree, I'm a bit scared of adding the "--no_push" flag, but we can definitely do it if needed...

from levanter.

rjpower commented on August 15, 2024

Sent #628 which helps a bit (and adds status!)

Re: not exiting all the way -- yeah, I've had some issues when I try to Ctrl-C out of a run, things don't close down fully. I think the google ssh just interrupts one of the commands? Not sure how to best handle that; we can trap the interrupt and then send a docker stop --kill to force everything closed.

For the push slowness, can you double check you don't have other stuff that's being added to the context? I found ncdu -X .dockerignore is pretty accurate for that. I found I would have stale cache files etc that I was sending on accident and it would really blow things up. You pay the cost for the whole source directory every time, even if you have some big files that don't change. (If there are some big files we want to put in Docker, we just need to add separate COPY lines for them).

from levanter.

rjpower commented on August 15, 2024

I have no attachment to the time for the run name! Happy to use whatever you suggest instead.

Are you're thinking we'd run wandb.init from the launch script and then thread WANDB_NAME and WANDB_RUN_ID through to the container? Or are you just thinking we should have a different way to make a unique run_id value?

from levanter.

dlwh commented on August 15, 2024

no, i'm not sure what's going on. The old scripts would automatically set the runid to a "wandb-esque" 8-letter alphabetic sequence, and the run name would still default to one of the pretty "noble-capybara-86" things that wandb assigned server-side. (but you could override in config)

Somehow now the run name and the run id are getting set to the time. Run id being set to the time seems fine, but i find it easier to at least temporarily remember what distinguishes noble-capybara from morning-river when I'm iterating quickly

from levanter.

dlwh commented on August 15, 2024

(thanks for the PR! will look, but might be tomorrow night)

from levanter.

rjpower commented on August 15, 2024

Sure, no hurry.

Now that you point it out... I can't figure out what's going on with wandb either. I tried changing the run-id to some alphanumeric stuff but it keeps insisting on using the id as the name. Moving the wandb initialization to launch.py doesn't seem like a bad idea (like we could use env var instead of the jax broadcast if desired), but it kind of bakes it in as a tracker and I'd like to avoid doing it just because I can't figure out how wandb works...

I printed out int args and env vars in case something stands out:

GIT_COMMIT=76232731e71fa3f6d47761f5e923471dcfb45d85
GPG_KEY=A035C8C19219BA821ECEA86B64E628F8D684696D
GRPC_VERBOSITY=ERROR
HF_TOKEN=
HOME=/home/levanter
HOSTNAME=t1v-n-b6c32295-w-0
JAX_PLATFORMS=tpu,cpu
LANG=C.UTF-8
LIBTPU_INIT_ARGS=--xla_tpu_impure_oom_fast_exit_threshold=-1
PATH=/opt/levanter/.venv/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
PYTHONPATH=/opt/levanter:/opt/levanter/src:/opt/levanter/examples:/opt/levanter/tests
PYTHON_GET_PIP_SHA256=dfe9fd5c28dc98b5ac17979a953ea550cec37ae1b47a5116007395bfacff2ab9
PYTHON_GET_PIP_URL=https://github.com/pypa/get-pip/raw/dbf0c85f76fb6e1ab42aa672ffca6f0a675d9ee4/public/get-pip.py
PYTHON_PIP_VERSION=23.0.1
PYTHON_SETUPTOOLS_VERSION=65.5.1
PYTHON_VERSION=3.10.14
RAY_CLIENT_MODE=0
RAY_USAGE_STATS_ENABLED=0
RUN_ID=abcde
TENSORSTORE_CURL_LOW_SPEED_LIMIT_BYTES=1024
TENSORSTORE_CURL_LOW_SPEED_TIME_SECONDS=60
TERM=xterm
TF_CPP_MIN_LOG_LEVEL=1
TPU_MIN_LOG_LEVEL=0
TPU_ML_PLATFORM=JAX
TPU_STDERR_LOG_LEVEL=0
TRANSFORMERS_VERBOSITY=error
WANDB_API_KEY=
WANDB_DOCKER=levanter-power
WANDB_ENTITY=wasabi-labs
WANDB_PROJECT=levanter

WANDB init  entity= None  project= levanter  name= None  tags= ['openwebtext', 'llama']  id= abcde  group= None  resume= allow  mode= disabled  config= {'git_commit': '76232731e71fa3f6d47761f5e923471dcfb45d85'}

from levanter.

rjpower commented on August 15, 2024

This seems very erratic. I can reproduce the weird behavior without Docker:

RUN_ID=3abcde123 python src/levanter/main/train_lm.py --config=config/gpt2_nano.yaml --data.cache_dir=gs://wasabi-tpu-training/gpt2-nano/

will sometimes give a nice name, but usually seems to be using the id as the name. I also tried investigating wandb login vs env vars but that doesn't seem to change things. I can just have launch.py create a nice name and feed it to WANDB_NAME if needed.

from levanter.

dlwh commented on August 15, 2024

that's bizarre. maybe something is going on with wandb right now? but the timing is suspicious

from levanter.

dlwh commented on August 15, 2024

oh the wandb mode you logged is disabled (which is by design if you're not grabbing jax.process_index() == 0)

from levanter.

rjpower commented on August 15, 2024

oh the wandb mode you logged is disabled (which is by design if you're not grabbing jax.process_index() == 0)

Ah, yeah I just grabbed a random worker.

that's bizarre. maybe something is going on with wandb right now? but the timing is suspicious

Yeah, I can't figure it out. I've been tracing the flow through wandb_init.py. The display name seems to be generated by their server. If I don't assign a run id, the client will generate one, and then the server will return a nice display name. If I give my own unique name, it blows up. If I copy-and-paste the wandb name and change a character, it's okay.

Maybe it's checking that the name has some specific format (exactly 8 characters? anything else?)

from levanter.

dlwh commented on August 15, 2024

Oops, alphanumeric base-36. This is wandb's reference implementation in their SDK

def generate_id(length: int = 8) -> str:
    """Generate a random base-36 string of `length` digits."""
    # There are ~2.8T base-36 8-digit strings. If we generate 210k ids,
    # we'll have a ~1% chance of collision.
    alphabet = string.ascii_lowercase + string.digits
    return "".join(secrets.choice(alphabet) for _ in range(length))

from levanter.

rjpower commented on August 15, 2024

How bizarre. Yeah, I changed the generation to make 8 character base-32 run ids and it works (but only if they're lowercase!)

from levanter.

dlwh commented on August 15, 2024

lol amazing

from levanter.

dlwh commented on August 15, 2024

i have more thoughts but these all seem fixed!

from levanter.

Random thoughts on the new docker stuff about levanter HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent