Giter Site home page Giter Site logo

Comments (15)

dlwh avatar dlwh commented on August 15, 2024

(cc @rjpower I'll keep editing as i figure things out etc.)

from levanter.

rjpower avatar rjpower commented on August 15, 2024

Good ideas. Yeah I think docker just wants a pty for the statusbar to show up. We can do something with pty.spawn instead of subprocess.check_output to make that better.

Eyeballing it now, I'd say about half the time is redudant work we do now to setup the repo and VM that we can skip (or just merge some commands together). I thought about doing this before but was running into issues where you change things a little bit and now something is stale. I can probably just add a --force_init_vm flag to help with this.

I agree, I'm a bit scared of adding the "--no_push" flag, but we can definitely do it if needed...

from levanter.

rjpower avatar rjpower commented on August 15, 2024

Sent #628 which helps a bit (and adds status!)

Re: not exiting all the way -- yeah, I've had some issues when I try to Ctrl-C out of a run, things don't close down fully. I think the google ssh just interrupts one of the commands? Not sure how to best handle that; we can trap the interrupt and then send a docker stop --kill to force everything closed.

For the push slowness, can you double check you don't have other stuff that's being added to the context? I found ncdu -X .dockerignore is pretty accurate for that. I found I would have stale cache files etc that I was sending on accident and it would really blow things up. You pay the cost for the whole source directory every time, even if you have some big files that don't change. (If there are some big files we want to put in Docker, we just need to add separate COPY lines for them).

from levanter.

rjpower avatar rjpower commented on August 15, 2024

I have no attachment to the time for the run name! Happy to use whatever you suggest instead.

Are you're thinking we'd run wandb.init from the launch script and then thread WANDB_NAME and WANDB_RUN_ID through to the container? Or are you just thinking we should have a different way to make a unique run_id value?

from levanter.

dlwh avatar dlwh commented on August 15, 2024

no, i'm not sure what's going on. The old scripts would automatically set the runid to a "wandb-esque" 8-letter alphabetic sequence, and the run name would still default to one of the pretty "noble-capybara-86" things that wandb assigned server-side. (but you could override in config)

Somehow now the run name and the run id are getting set to the time. Run id being set to the time seems fine, but i find it easier to at least temporarily remember what distinguishes noble-capybara from morning-river when I'm iterating quickly

from levanter.

dlwh avatar dlwh commented on August 15, 2024

(thanks for the PR! will look, but might be tomorrow night)

from levanter.

rjpower avatar rjpower commented on August 15, 2024

Sure, no hurry.

Now that you point it out... I can't figure out what's going on with wandb either. I tried changing the run-id to some alphanumeric stuff but it keeps insisting on using the id as the name. Moving the wandb initialization to launch.py doesn't seem like a bad idea (like we could use env var instead of the jax broadcast if desired), but it kind of bakes it in as a tracker and I'd like to avoid doing it just because I can't figure out how wandb works...

I printed out int args and env vars in case something stands out:

GIT_COMMIT=76232731e71fa3f6d47761f5e923471dcfb45d85
GPG_KEY=A035C8C19219BA821ECEA86B64E628F8D684696D
GRPC_VERBOSITY=ERROR
HF_TOKEN=
HOME=/home/levanter
HOSTNAME=t1v-n-b6c32295-w-0
JAX_PLATFORMS=tpu,cpu
LANG=C.UTF-8
LIBTPU_INIT_ARGS=--xla_tpu_impure_oom_fast_exit_threshold=-1
PATH=/opt/levanter/.venv/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
PYTHONPATH=/opt/levanter:/opt/levanter/src:/opt/levanter/examples:/opt/levanter/tests
PYTHON_GET_PIP_SHA256=dfe9fd5c28dc98b5ac17979a953ea550cec37ae1b47a5116007395bfacff2ab9
PYTHON_GET_PIP_URL=https://github.com/pypa/get-pip/raw/dbf0c85f76fb6e1ab42aa672ffca6f0a675d9ee4/public/get-pip.py
PYTHON_PIP_VERSION=23.0.1
PYTHON_SETUPTOOLS_VERSION=65.5.1
PYTHON_VERSION=3.10.14
RAY_CLIENT_MODE=0
RAY_USAGE_STATS_ENABLED=0
RUN_ID=abcde
TENSORSTORE_CURL_LOW_SPEED_LIMIT_BYTES=1024
TENSORSTORE_CURL_LOW_SPEED_TIME_SECONDS=60
TERM=xterm
TF_CPP_MIN_LOG_LEVEL=1
TPU_MIN_LOG_LEVEL=0
TPU_ML_PLATFORM=JAX
TPU_STDERR_LOG_LEVEL=0
TRANSFORMERS_VERBOSITY=error
WANDB_API_KEY=
WANDB_DOCKER=levanter-power
WANDB_ENTITY=wasabi-labs
WANDB_PROJECT=levanter
WANDB init  entity= None  project= levanter  name= None  tags= ['openwebtext', 'llama']  id= abcde  group= None  resume= allow  mode= disabled  config= {'git_commit': '76232731e71fa3f6d47761f5e923471dcfb45d85'}

from levanter.

rjpower avatar rjpower commented on August 15, 2024

This seems very erratic. I can reproduce the weird behavior without Docker:

RUN_ID=3abcde123 python src/levanter/main/train_lm.py --config=config/gpt2_nano.yaml --data.cache_dir=gs://wasabi-tpu-training/gpt2-nano/

will sometimes give a nice name, but usually seems to be using the id as the name. I also tried investigating wandb login vs env vars but that doesn't seem to change things. I can just have launch.py create a nice name and feed it to WANDB_NAME if needed.

from levanter.

dlwh avatar dlwh commented on August 15, 2024

that's bizarre. maybe something is going on with wandb right now? but the timing is suspicious

from levanter.

dlwh avatar dlwh commented on August 15, 2024

oh the wandb mode you logged is disabled (which is by design if you're not grabbing jax.process_index() == 0)

from levanter.

rjpower avatar rjpower commented on August 15, 2024

oh the wandb mode you logged is disabled (which is by design if you're not grabbing jax.process_index() == 0)

Ah, yeah I just grabbed a random worker.

that's bizarre. maybe something is going on with wandb right now? but the timing is suspicious

Yeah, I can't figure it out. I've been tracing the flow through wandb_init.py. The display name seems to be generated by their server. If I don't assign a run id, the client will generate one, and then the server will return a nice display name. If I give my own unique name, it blows up. If I copy-and-paste the wandb name and change a character, it's okay.

Maybe it's checking that the name has some specific format (exactly 8 characters? anything else?)

from levanter.

dlwh avatar dlwh commented on August 15, 2024

Oops, alphanumeric base-36. This is wandb's reference implementation in their SDK

def generate_id(length: int = 8) -> str:
    """Generate a random base-36 string of `length` digits."""
    # There are ~2.8T base-36 8-digit strings. If we generate 210k ids,
    # we'll have a ~1% chance of collision.
    alphabet = string.ascii_lowercase + string.digits
    return "".join(secrets.choice(alphabet) for _ in range(length))

from levanter.

rjpower avatar rjpower commented on August 15, 2024

How bizarre. Yeah, I changed the generation to make 8 character base-32 run ids and it works (but only if they're lowercase!)

from levanter.

dlwh avatar dlwh commented on August 15, 2024

lol amazing

from levanter.

dlwh avatar dlwh commented on August 15, 2024

i have more thoughts but these all seem fixed!

from levanter.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.