Giter Site home page Giter Site logo

Comments (5)

rb-determined-ai avatar rb-determined-ai commented on June 12, 2024

oh, hey, this is my code!

The exit code of 80 with no logs means that the log shipper crashed for some reason.

There's a special escape hatch built in to help debug this case.

You can rerun this task with a bind mount with container_path: /ship_logs, and see what shows up in that directory. There should be a stack trace that explains what happened.

from determined.

samjenks avatar samjenks commented on June 12, 2024

Thanks, that got me more info. This seems to be the overall error. any thoughts on why the localhost is refusing connection?
I switched the port from 8080 to 8090 during setup and swapped the socket and env variables as well. The webui connects there but is there something else I forgot to change to allow the connection?

ERROR: [1] root: Unable to reach the master at DET_MASTER=http://127.0.0.1:8090/. The connection to http://127.0.0.1:8090/ was refused.
Debug information:
master_url: http://127.0.0.1:8090/
endpoint: http://127.0.0.1:8090/api/v1/me
tls_verify_name: None
tls_noverify: False
tls_cert: None
http_proxy: None
HTTP_PROXY: None
no_proxy: None
NO_PROXY: None

from determined.

rb-determined-ai avatar rb-determined-ai commented on June 12, 2024

Well, that depends. Do you have host networking mode configured? If not, then DET_MASTER of 127.0.0.1 will never work. You might have to configure the agent with container_master_host with an appropriate IP address that can point to the master.

But if things were working before and all you changed was the port, then probably you do have host networking mode configured, and it's a firewall problem.

Can you share more info about how you launched the cluster, and what your master and agent config files contain?

from determined.

samjenks avatar samjenks commented on June 12, 2024

I switched the networking mode from bridge to host and it all worked for the const.yaml. however the distrib.yaml hangs for a very long time during training without finishing. I've run it a couple of times to verify. It does it at the downloading dataset mostly. Is this normal or another problem with my configuration. The logging files don't have any information, is there a way to see why its hanging?

I launched the cluster via the deploy on prem guide here

Edit: The hanging seems to occur when two of the gpus are in the same job, if they aren't in an experiment together everything works, if they are, the code hangs in various places. I'm going to close this for the initial bug, will make a different post once I've run down some theories

from determined.

rb-determined-ai avatar rb-determined-ai commented on June 12, 2024

Hi sam, I'd say "don't waste your time". Instead focus on making sure you don't have 127.0.0.1 for your master address in the first place.

I'd recommend setting the container_master_host in the agent yaml to an IP address that resolves to your master node even from inside a container (often the external IP address of the machine running the master).

from determined.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.