Comments (5)
oh, hey, this is my code!
The exit code of 80 with no logs means that the log shipper crashed for some reason.
There's a special escape hatch built in to help debug this case.
You can rerun this task with a bind mount with container_path: /ship_logs
, and see what shows up in that directory. There should be a stack trace that explains what happened.
from determined.
Thanks, that got me more info. This seems to be the overall error. any thoughts on why the localhost is refusing connection?
I switched the port from 8080 to 8090 during setup and swapped the socket and env variables as well. The webui connects there but is there something else I forgot to change to allow the connection?
ERROR: [1] root: Unable to reach the master at DET_MASTER=http://127.0.0.1:8090/. The connection to http://127.0.0.1:8090/ was refused.
Debug information:
master_url: http://127.0.0.1:8090/
endpoint: http://127.0.0.1:8090/api/v1/me
tls_verify_name: None
tls_noverify: False
tls_cert: None
http_proxy: None
HTTP_PROXY: None
no_proxy: None
NO_PROXY: None
from determined.
Well, that depends. Do you have host networking mode configured? If not, then DET_MASTER
of 127.0.0.1
will never work. You might have to configure the agent with container_master_host
with an appropriate IP address that can point to the master.
But if things were working before and all you changed was the port, then probably you do have host networking mode configured, and it's a firewall problem.
Can you share more info about how you launched the cluster, and what your master and agent config files contain?
from determined.
I switched the networking mode from bridge
to host
and it all worked for the const.yaml. however the distrib.yaml hangs for a very long time during training without finishing. I've run it a couple of times to verify. It does it at the downloading dataset mostly. Is this normal or another problem with my configuration. The logging files don't have any information, is there a way to see why its hanging?
I launched the cluster via the deploy on prem guide here
Edit: The hanging seems to occur when two of the gpus are in the same job, if they aren't in an experiment together everything works, if they are, the code hangs in various places. I'm going to close this for the initial bug, will make a different post once I've run down some theories
from determined.
Hi sam, I'd say "don't waste your time". Instead focus on making sure you don't have 127.0.0.1 for your master address in the first place.
I'd recommend setting the container_master_host
in the agent yaml to an IP address that resolves to your master node even from inside a container (often the external IP address of the machine running the master).
from determined.
Related Issues (20)
- πUpdate readme for @hpe.com/glide-data-grid and consider contributing back HOT 7
- π[bug]
- π€[question] add resource_pools HOT 1
- π[bug] show_ssh_command error on Windows CMD: module 'os' has no attribute 'uname' HOT 5
- π[bug] det CLI tool errors on Python 3.12 because it relies on distutils which was deprecated in Python 3.10 HOT 3
- π€[question] LOGs HOT 4
- π€ model registry - inference with pytorch model HOT 1
- π[bug] Error Starting Up Cluster using det deploy HOT 4
- π[bug] Bad ref on requirements.rst in Docs HOT 1
- π[bug] Master refuses to accept agents connection HOT 4
- π€[question] Changing the default config path for the determined-agent.service HOT 5
- π€[question] Updating the default Determined-Pytorch container to 2.1/2.2 HOT 1
- π[bug] Running Mnist Tutorial distributed causes Runtime Errors and Hanging behavior HOT 12
- π€[question] dialing to http://172.22.0.1:32862: dial tcp 172.22.0.1:32862: connect: connection refused HOT 2
- π[bug] Kernel status: pending HOT 11
- π€[question] Where can I find the source code of the CLI? HOT 1
- π€[question] Can not connect to master node HOT 6
- π€[question] Open to updates to EKS deployment? HOT 6
- π€[question] How to get pod address by experiment HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from determined.