Comments (13)
Thanks @rb-determined-ai for your illuminating comment! It finally worked out!
I just post there the additional steps to solve this problem:
- Localize the id of the network used by the cluster using
docker network ls
and take out the . - Run
ifconfig
on the cluster machine and verify that there is an interface calledbr-<ID>
- Run
iptables -A INPUT -i br-<ID> -j ACCEPT
Let me know if I can contribute with you to put this information in the FAQ Docs or directly in the installation docs.
Thanks for your precious help.
from determined.
Hi @PieroMacaluso can you give us a few more details about the machine on which you've spawned the cluster?
What OS are you running? Mac/Linux?
What's the output of docker network ls
, and ip a
& ip route
if on Linux, or netstat -rn
if on Mac?
from determined.
Hi @vishnu2kmohan and thanks for your reply!
I am running this on: Ubuntu 20.04 LTS
Here are the outputs I get
docker network ls
NETWORK ID NAME DRIVER SCOPE
6289933d84f3 bridge bridge local
9f0ff59e2639 determined_default bridge local
ebc032543c8a host host local
e8909223a859 none null local
ip a
(privacy filter applied)
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp2s0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether d4:5d:64:07:a4:41 brd ff:ff:ff:ff:ff:ff
3: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether <privacy-filter> brd ff:ff:ff:ff:ff:ff
inet <privacy-filter> brd <privacy-filter> scope global eno1
valid_lft forever preferred_lft forever
inet6 <privacy-filter> scope link
valid_lft forever preferred_lft forever
5: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:52:f6:cc:7c brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
inet6 fe80::42:52ff:fef6:cc7c/64 scope link
valid_lft forever preferred_lft forever
555: vethc4b3d3f@if554: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default
link/ether 2a:0a:05:0d:5e:bb brd ff:ff:ff:ff:ff:ff link-netnsid 3
inet6 fe80::280a:5ff:fe0d:5ebb/64 scope link
valid_lft forever preferred_lft forever
556: br-9f0ff59e2639: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:f1:f6:02:2b brd ff:ff:ff:ff:ff:ff
inet 192.168.64.1/20 brd 192.168.79.255 scope global br-9f0ff59e2639
valid_lft forever preferred_lft forever
inet6 fe80::42:f1ff:fef6:22b/64 scope link
valid_lft forever preferred_lft forever
558: vethbb830b8@if557: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-9f0ff59e2639 state UP group default
link/ether 4e:9e:f7:4b:ff:0d brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 fe80::4c9e:f7ff:fe4b:ff0d/64 scope link
valid_lft forever preferred_lft forever
560: veth497c3f0@if559: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-9f0ff59e2639 state UP group default
link/ether 0a:49:fc:ec:65:a5 brd ff:ff:ff:ff:ff:ff link-netnsid 1
inet6 fe80::849:fcff:feec:65a5/64 scope link
valid_lft forever preferred_lft forever
336: veth1bec14a@if335: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default
link/ether be:cd:46:16:23:d1 brd ff:ff:ff:ff:ff:ff link-netnsid 2
inet6 fe80::bccd:46ff:fe16:23d1/64 scope link
valid_lft forever preferred_lft forever
ip route
(privacy filter applied)
default via <privacy-filter> dev eno1 proto static
<privacy-filter> dev eno1 proto kernel scope link src <privacy-filter>
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1
192.168.64.0/20 dev br-9f0ff59e2639 proto kernel scope link src 192.168.64.1
from determined.
when I try to open the notebook from the browser
@PieroMacaluso how are you trying to open the Notebook from the browser?
You should be able to load up the Determined WebUI by pointing to http://localhost:8080
(or to http://<Hostname or IP of your Ubuntu machine>:8080
if you're not using a browser launched on the Ubuntu machine itself) and on the Dashboard (which is the default page that loads up, on which) you should see an active task card for the Notebook, which, when clicked, should load it up.
Alternatively, you can navigate to the Notebooks section on the left sidebar of the WebUI, and click the Open
button against the entry, to launch it from there.
Finally, you should also be able to access it directly by visiting https://<Hostname/IP of your Ubuntu machine>:8080/proxy/c91200a8-bae3-4c61-b3e9-07af6d2fc51e/lab/tree/Notebook.ipynb
(assuming the the UUID of the Notebook that was spawned (still) is c91200a8-bae3-4c61-b3e9-07af6d2fc51e
)
from determined.
I tried to open the notebooks by using the approaches you described.
If I open the notebook from the open button, the system stays on this page for seconds.
Then it redirects to https://<Hostname/IP of your Ubuntu machine>:8080/proxy/c91200a8-bae3-4c61-b3e9-07af6d2fc51e/lab/tree/Notebook.ipynb?reset
, but it returns ERROR 502
.
To force this behaviour I tried to use the last approach https://<Hostname/IP of your Ubuntu machine>:8080/proxy/c91200a8-bae3-4c61-b3e9-07af6d2fc51e/lab/tree/Notebook.ipynb
but still I receive ERROR 502
.
from determined.
@PieroMacaluso, thanks for reporting this, this definitely should work.
Would you mind sharing a portion of your master logs that covers the window starting with when the notebook started and includes the moment when you hit the 502 error? You can find them from the Web UI in the lower left-hand corner.
Also, it would be helpful if we could see how the agent was launched. Could you run docker ps
to get the container ID of the container running the determined-agent
image, and then send the output of the docker exec <the agent container id> env
.
Also, if you could share agent logs, found via docker container logs <the agent container id>
, that would be excellent.
Finally, if you could share more about the shape of your <Hostname/IP of your Ubuntu machine>
, that would help. We have hit bugs in the past where we could handle an address like xxx.xxx.xxx.xxx
but where we didn't properly handle [::xxxx:xxxx]
or something, and it would be good to confirm this isn't another instance of something like that.
from determined.
Thanks @rb-determined-ai for your time.
Would you mind sharing a portion of your master logs that covers the window starting with when the notebook started and includes the moment when you hit the 502 error? You can find them from the Web UI in the lower left-hand corner.
2020-07-23, 17:17:00 info creating notebook id="notebooks" system="master" type="notebookManager"
2020-07-23, 17:17:00 info created notebook e2846689-1d3f-4a15-9310-af086b7aa6f8 id="notebooks" system="master" type="notebookManager"
2020-07-23, 17:17:01 info registering service: e2846689-1d3f-4a15-9310-af086b7aa6f8 (http://130.192.93.60:32817) id="proxy" system="master" type="Proxy"
2020-07-23, 17:17:04 info readiness check passed: notebook id="e2846689-1d3f-4a15-9310-af086b7aa6f8" system="master" type="command"
#### Page hits HTTP Error 502 #### <- this is not part of the log
2020-07-23, 17:17:37 error error while actor was running error="websocket: close 1001 (going away)" id="7bfea5eb-37d0-4ad7-bef6-a75559cecd45" system="master" type="websocketActor"
2020-07-23, 17:17:37 error websocket: close 1001 (going away)
2020-07-23, 17:17:37 error http: connection has been hijacked
Also, it would be helpful if we could see how the agent was launched. Could you run docker ps to get the container ID of the container running the
determined-agent
image, and then send the output of thedocker exec <the agent container id> env
.
PATH=/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=a183c3940ec8
PYTHONUSERBASE=/run/determined/pythonuserbase
DET_TASK_ID=e2846689-1d3f-4a15-9310-af086b7aa6f8
DET_CLUSTER_ID=8016b3be-f118-412b-88eb-3c68100ae9d2
DET_MASTER_ID=6379c0b9-1a50-4b81-85eb-540bbfcf4e40
DET_MASTER=<privacy-filter>:8080
DET_MASTER_HOST=<privacy-filter>
DET_MASTER_ADDR=<privacy-filter>
DET_MASTER_PORT=8080
DET_AGENT_ID=determined-agent-0
DET_CONTAINER_ID=a9666ffe-7034-4b1c-a4ad-65e6cad5fa25
DET_SLOT_IDS=[3]
DET_USE_GPU=true
CUDA_VERSION=10.0.130
CUDA_PKG_VERSION=10-0=10.0.130-1
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NVIDIA_REQUIRE_CUDA=cuda>=10.0 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=410,driver<411
NCCL_VERSION=2.4.8
LIBRARY_PATH=/usr/local/cuda/lib64/stubs
CUDNN_VERSION=7.6.5.32
LANG=C.UTF-8
LC_ALL=C.UTF-8
CONDA_DIR=/opt/conda
PYTHONUNBUFFERED=1
PYTHONFAULTHANDLER=1
PYTHONHASHSEED=0
JUPYTER_CONFIG_DIR=/run/determined/jupyter/config
JUPYTER_DATA_DIR=/run/determined/jupyter/data
JUPYTER_RUNTIME_DIR=/run/determined/jupyter/runtime
HOME=/root
Also, if you could share agent logs, found via docker container logs , that would be excellent.
Processing /opt/determined/wheels/determined-0.12.11-py3-none-any.whl
Processing /opt/determined/wheels/determined_cli-0.12.11-py3-none-any.whl
Processing /opt/determined/wheels/determined_common-0.12.11-py3-none-any.whl
Requirement already satisfied: psutil in /opt/conda/lib/python3.6/site-packages (from determined==0.12.11) (5.7.0)
Requirement already satisfied: cloudpickle==0.5.3 in /opt/conda/lib/python3.6/site-packages (from determined==0.12.11) (0.5.3)
Requirement already satisfied: lomond==0.3.3 in /opt/conda/lib/python3.6/site-packages (from determined==0.12.11) (0.3.3)
Requirement already satisfied: packaging==19.0 in /opt/conda/lib/python3.6/site-packages (from determined==0.12.11) (19.0)
Requirement already satisfied: numpy>=1.16.2 in /opt/conda/lib/python3.6/site-packages (from determined==0.12.11) (1.18.5)
Requirement already satisfied: simplejson==3.16.0 in /opt/conda/lib/python3.6/site-packages (from determined==0.12.11) (3.16.0)
Requirement already satisfied: matplotlib in /opt/conda/lib/python3.6/site-packages (from determined==0.12.11) (3.2.1)
Requirement already satisfied: requests>=2.20.0 in /opt/conda/lib/python3.6/site-packages (from determined==0.12.11) (2.23.0)
Requirement already satisfied: dill>=0.2.9 in /opt/conda/lib/python3.6/site-packages (from determined==0.12.11) (0.3.1.1)
Requirement already satisfied: boto3>=1.9.220 in /opt/conda/lib/python3.6/site-packages (from determined==0.12.11) (1.14.0)
Collecting yogadl==0.1.1
Using cached yogadl-0.1.1-py3-none-any.whl (32 kB)
Requirement already satisfied: h5py>=2.9.0 in /opt/conda/lib/python3.6/site-packages (from determined==0.12.11) (2.10.0)
Requirement already satisfied: pyzmq==18.1.0 in /opt/conda/lib/python3.6/site-packages (from determined==0.12.11) (18.1.0)
Requirement already satisfied: gitpython==2.1.11 in /opt/conda/lib/python3.6/site-packages (from determined-cli==0.12.11) (2.1.11)
Requirement already satisfied: python-dateutil==2.8.0 in /opt/conda/lib/python3.6/site-packages (from determined-cli==0.12.11) (2.8.0)
Requirement already satisfied: ruamel.yaml>=0.15.78 in /opt/conda/lib/python3.6/site-packages (from determined-cli==0.12.11) (0.16.10)
Requirement already satisfied: tabulate>=0.8.3 in /opt/conda/lib/python3.6/site-packages (from determined-cli==0.12.11) (0.8.7)
Requirement already satisfied: termcolor==1.1.0 in /opt/conda/lib/python3.6/site-packages (from determined-cli==0.12.11) (1.1.0)
Requirement already satisfied: argcomplete==1.9.4 in /opt/conda/lib/python3.6/site-packages (from determined-cli==0.12.11) (1.9.4)
Requirement already satisfied: hdfs>=2.2.2 in /opt/conda/lib/python3.6/site-packages (from determined-common==0.12.11) (2.5.8)
Requirement already satisfied: pathspec>=0.6.0 in /opt/conda/lib/python3.6/site-packages (from determined-common==0.12.11) (0.8.0)
Requirement already satisfied: google-cloud-storage>=1.20.0 in /opt/conda/lib/python3.6/site-packages (from determined-common==0.12.11) (1.28.1)
Requirement already satisfied: six>=1.10.0 in /opt/conda/lib/python3.6/site-packages (from lomond==0.3.3->determined==0.12.11) (1.15.0)
Requirement already satisfied: pyparsing>=2.0.2 in /opt/conda/lib/python3.6/site-packages (from packaging==19.0->determined==0.12.11) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.6/site-packages (from matplotlib->determined==0.12.11) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.6/site-packages (from matplotlib->determined==0.12.11) (1.2.0)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.6/site-packages (from requests>=2.20.0->determined==0.12.11) (2020.4.5.1)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.6/site-packages (from requests>=2.20.0->determined==0.12.11) (2.9)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/lib/python3.6/site-packages (from requests>=2.20.0->determined==0.12.11) (1.25.8)
Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/lib/python3.6/site-packages (from requests>=2.20.0->determined==0.12.11) (3.0.4)
Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /opt/conda/lib/python3.6/site-packages (from boto3>=1.9.220->determined==0.12.11) (0.3.3)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /opt/conda/lib/python3.6/site-packages (from boto3>=1.9.220->determined==0.12.11) (0.10.0)
Requirement already satisfied: botocore<1.18.0,>=1.17.0 in /opt/conda/lib/python3.6/site-packages (from boto3>=1.9.220->determined==0.12.11) (1.17.0)
Requirement already satisfied: websockets in /opt/conda/lib/python3.6/site-packages (from yogadl==0.1.1->determined==0.12.11) (8.1)
Requirement already satisfied: async-generator in /opt/conda/lib/python3.6/site-packages (from yogadl==0.1.1->determined==0.12.11) (1.10)
Requirement already satisfied: lmdb in /opt/conda/lib/python3.6/site-packages (from yogadl==0.1.1->determined==0.12.11) (0.98)
Requirement already satisfied: filelock in /opt/conda/lib/python3.6/site-packages (from yogadl==0.1.1->determined==0.12.11) (3.0.12)
Requirement already satisfied: gitdb2>=2.0.0 in /opt/conda/lib/python3.6/site-packages (from gitpython==2.1.11->determined-cli==0.12.11) (4.0.2)
Requirement already satisfied: ruamel.yaml.clib>=0.1.2; platform_python_implementation == "CPython" and python_version < "3.9" in /opt/conda/lib/python3.6/site-packages (from ruamel.yaml>=0.15.78->determined-cli==0.12.11) (0.2.0)
Requirement already satisfied: docopt in /opt/conda/lib/python3.6/site-packages (from hdfs>=2.2.2->determined-common==0.12.11) (0.6.2)
Requirement already satisfied: google-cloud-core<2.0dev,>=1.2.0 in /opt/conda/lib/python3.6/site-packages (from google-cloud-storage>=1.20.0->determined-common==0.12.11) (1.3.0)
Requirement already satisfied: google-resumable-media<0.6dev,>=0.5.0 in /opt/conda/lib/python3.6/site-packages (from google-cloud-storage>=1.20.0->determined-common==0.12.11) (0.5.1)
Requirement already satisfied: google-auth<2.0dev,>=1.11.0 in /opt/conda/lib/python3.6/site-packages (from google-cloud-storage>=1.20.0->determined-common==0.12.11) (1.16.1)
Requirement already satisfied: docutils<0.16,>=0.10 in /opt/conda/lib/python3.6/site-packages (from botocore<1.18.0,>=1.17.0->boto3>=1.9.220->determined==0.12.11) (0.15.2)
Requirement already satisfied: gitdb>=4.0.1 in /opt/conda/lib/python3.6/site-packages (from gitdb2>=2.0.0->gitpython==2.1.11->determined-cli==0.12.11) (4.0.5)
Requirement already satisfied: google-api-core<2.0.0dev,>=1.16.0 in /opt/conda/lib/python3.6/site-packages (from google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.20.0->determined-common==0.12.11) (1.20.0)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/conda/lib/python3.6/site-packages (from google-auth<2.0dev,>=1.11.0->google-cloud-storage>=1.20.0->determined-common==0.12.11) (0.2.8)
Requirement already satisfied: cachetools<5.0,>=2.0.0 in /opt/conda/lib/python3.6/site-packages (from google-auth<2.0dev,>=1.11.0->google-cloud-storage>=1.20.0->determined-common==0.12.11) (4.1.0)
Requirement already satisfied: rsa<4.1,>=3.1.4 in /opt/conda/lib/python3.6/site-packages (from google-auth<2.0dev,>=1.11.0->google-cloud-storage>=1.20.0->determined-common==0.12.11) (4.0)
Requirement already satisfied: setuptools>=40.3.0 in /opt/conda/lib/python3.6/site-packages (from google-auth<2.0dev,>=1.11.0->google-cloud-storage>=1.20.0->determined-common==0.12.11) (47.1.1.post20200604)
Requirement already satisfied: smmap<4,>=3.0.1 in /opt/conda/lib/python3.6/site-packages (from gitdb>=4.0.1->gitdb2>=2.0.0->gitpython==2.1.11->determined-cli==0.12.11) (3.0.4)
Requirement already satisfied: googleapis-common-protos<2.0dev,>=1.6.0 in /opt/conda/lib/python3.6/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.20.0->determined-common==0.12.11) (1.52.0)
Requirement already satisfied: pytz in /opt/conda/lib/python3.6/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.20.0->determined-common==0.12.11) (2020.1)
Requirement already satisfied: protobuf>=3.4.0 in /opt/conda/lib/python3.6/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.20.0->determined-common==0.12.11) (3.12.2)
Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /opt/conda/lib/python3.6/site-packages (from pyasn1-modules>=0.2.1->google-auth<2.0dev,>=1.11.0->google-cloud-storage>=1.20.0->determined-common==0.12.11) (0.4.8)
Installing collected packages: determined-common, yogadl, determined, determined-cli
Successfully installed determined-0.12.11 determined-cli-0.12.11 determined-common-0.12.11 yogadl-0.1.1
[I 15:17:04.131 LabApp] Writing notebook server cookie secret to /run/determined/jupyter/runtime/notebook_cookie_secret
[W 15:17:04.446 LabApp] All authentication is disabled. Anyone who can connect to this server will be able to run code.
[I 15:17:04.453 LabApp] JupyterLab extension loaded from /opt/conda/lib/python3.6/site-packages/jupyterlab
[I 15:17:04.453 LabApp] JupyterLab application directory is /opt/conda/share/jupyter/lab
[I 15:17:04.455 LabApp] Serving notebooks from local directory: /run/determined/workdir
[I 15:17:04.455 LabApp] The Jupyter Notebook is running at:
[I 15:17:04.455 LabApp] http://a183c3940ec8:8888/proxy/e2846689-1d3f-4a15-9310-af086b7aa6f8/
[I 15:17:04.455 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Finally, if you could share more about the shape of your
<Hostname/IP of your Ubuntu machine>
, that would help. We have hit bugs in the past where we could handle an address likexxx.xxx.xxx.xxx
but where we didn't properly handle[::xxxx:xxxx]
or something, and it would be good to confirm this isn't another instance of something like that.
The IP shape of my Ubuntu machine is XXX.XXX.XX.XX
It seems to me that the link http://XXX.XXX.XX.XX/proxy/e2846689-1d3f-4a15-9310-af086b7aa6f8/
does not properly redirect the connection to http://XXX.XXX.XX.XX:32817/proxy/e2846689-1d3f-4a15-9310-af086b7aa6f8/
. If I put the last-mentioned link in the browser, it works! However, I noticed that the problem with the connection occurs even with the usage of Native API.
from determined.
I spent some time today trying to reproduce this on a fresh Ubuntu 20.04 machine, but no luck.
Ok, so we know if you are able to connect to the notebook by visiting http://XXX.XXX.XX.XX:32817/proxy/e2846689-1d3f-4a15-9310-af086b7aa6f8
that the notebook is running and the port is accessible.
We also know based on the on the registering service: ...
line from the master logs that the master is going to proxy connections for the UUID on that line towards the stuff inside of the parentheses on that line, which looks like it should absolutely be the same http://XXX.XXX.XX.XX:32817
that we just manually visited. But somehow that is failing, which is why we get a 502 error (although it is odd that there's no error message in the master logs).
The master is inside a container, so I would normally suspect name resolution or routing issues. However, if every time that you inserted <privacy-filter>
it was the same public XXX.XXX.XX.XX
ip address, then neither resolution nor routing could be a problem. (sidenote: were all instances of <privacy-filter>
the same public ip address?)
Then the only thing I can think of that is causing the problem would be if there was some firewall rule that was disallowing docker containers from talking to open ports on the host or something like that. That is the only explanation I can come up with for why you would be able to connect to port 32817
externally but the master would fail to connect to the same port.
That's easy enough to check... Could you identify the docker container for the determinedai/determined-master
image, then do something like this:
docker exec CONTAINER_ID apt update
docker exec CONTAINER_ID apt install curl -y
docker exec CONTAINER_ID curl -D - http://XXX.XXX.XX.XX:32817/proxy/e2846689-1d3f-4a15-9310-af086b7aa6f8/lab/tree/Notebook.ipynb
Also, just for good measure, would you mind sending the agent environment and the the agent container logs? In your last response you sent the environment and the container logs for the notebook container (which was also helpful). The agent docker container should be the one running the determinedai/determined-agent:0.12.11
image.
Thank you for your patience with all of this.
from determined.
Thanks for your patience and kindness @rb-determined-ai! Here it is what I have found.
The master is inside a container, so I would normally suspect name resolution or routing issues. However, if every time that you inserted it was the same public XXX.XXX.XX.XX ip address, then neither resolution nor routing could be a problem. (sidenote: were all instances of the same public ip address?)
I used always the same public IP address.
docker exec CONTAINER_ID apt update
docker exec CONTAINER_ID apt install curl -y
docker exec CONTAINER_ID curl -D - http://XXX.XXX.XX.XX:32817/proxy/e2846689-1d3f-4a15-9310-af086b7aa6f8/lab/tree/Notebook.ipynb
I executed this command on the machine (replacing correct variables) and the curl command does not receive an answer. I tried to leave it opened for 2 minutes.
Also, just for good measure, would you mind sending the agent environment and the agent container logs?
Agent Environment: determinedai/determined-agent:0.12.11
Output of Agent Logs from the start of a new notebook:
INFO[2020-07-27T10:09:41Z] transitioning state from ASSIGNED to PULLING id=17673215-55e8-495a-9f65-439433bd3afc system=determined-agent-0 type=containerActor
WARN[2020-07-27T10:09:41Z] can't retrieve any credential stores: can't open docker config: open /root/.docker/config.json: no such file or directory id=docker system=determined-agent-0 type=dockerActor
INFO[2020-07-27T10:09:41Z] transitioning state from PULLING to STARTING id=17673215-55e8-495a-9f65-439433bd3afc system=determined-agent-0 type=containerActor
INFO[2020-07-27T10:09:42Z] transitioning state from STARTING to RUNNING id=17673215-55e8-495a-9f65-439433bd3afc system=determined-agent-0 type=containerActor
(The ID in this part are different because I had to restart the cluster)
from determined.
curl is working with http://488f9e1cafe7:8888/proxy/bf8271a4-04b3-4c3c-bc3e-ba3cbe01c9d5/lab/tree/Notebook.ipynb
from determined.
Ok, cool. This narrows the issue down quite a bit. The fact that this works
curl http://488f9e1cafe7:8888/proxy/bf8271a4-04b3-4c3c-bc3e-ba3cbe01c9d5/lab/tree/Notebook.ipynb
but this doesn't
curl -D - http://XXX.XXX.XX.XX:32817/proxy/e2846689-1d3f-4a15-9310-af086b7aa6f8/lab/tree/Notebook.ipynb
confirms that the issue is a firewall issue. Because our system is designed to operate on a multi-machine cluster, we would not try to use docker to deal with bridged networking; i.e. we would never try to contact the notebook by trying to resolve the 488f9e1cafe7:8888
hostname. Instead, we configure the agent to expose a port from the container which it launches, and the master will reach out to the agent's IP address, at that port, which will get forwarded to the task container. It would look like this:
incoming to proxied to
master IP agent IP
----------+ +----------+
| | |
___v___|___ ___v______________
| | | | | | | | |
| | proxy | | | | port forward | |
| |_______| | | |_|____________| |
| | | | v | |
| master | | | notebook | |
|___________| | |______________| |
| |
| agent |
|__________________|
The whole reason for this proxying layer is that we know that the master is going to be accessible externally, but there is no guarantee that the agent is externally accessible; only that the agent is accessible from the master.
So this should shed some light on why det-deploy local
configures the agent and master via the external IP address of your machine; it allows both the agent and the master to think of each other as being on different machines in a way that is transparent to our system.
Now, we know that because this does not work
curl -D - http://XXX.XXX.XX.XX:32817/proxy/e2846689-1d3f-4a15-9310-af086b7aa6f8/lab/tree/Notebook.ipynb
that you either have a firewall rule blocking incoming connections to your machine from docker, or that you have a DROP or REJECT policy and you have not explicitly allowed incoming connections from docker.
If it is the second case, you might be able to enable the relevant connections via:
sudo iptables -A INPUT -i docker0 -j ACCEPT
from determined.
@PieroMacaluso Sorry for the delayed response, but if you wanted to contribute this to the FAQ Docs, that would be fantastic! We had another community member on our community slack hit the same issue actually.
from determined.
I will soon contribute to the FAQ Docs! Thanks for your time and help! π
from determined.
Related Issues (20)
- Anyway to avoid non-Admin User Ability to Delete Others' Task ContainersοΌ HOT 2
- How can I set output_dir in TrainingArguments?π€[question] HOT 2
- DDMScheduler parameter bug HOT 3
- π€[question] Customize Slack Webhook? HOT 1
- πUpdate readme for @hpe.com/glide-data-grid and consider contributing back HOT 7
- π[bug]
- π€[question] add resource_pools HOT 1
- π[bug] show_ssh_command error on Windows CMD: module 'os' has no attribute 'uname' HOT 5
- π[bug] det CLI tool errors on Python 3.12 because it relies on distutils which was deprecated in Python 3.10 HOT 3
- π€[question] LOGs HOT 4
- π€ model registry - inference with pytorch model HOT 1
- π[bug] Error Starting Up Cluster using det deploy HOT 4
- π[bug] Bad ref on requirements.rst in Docs HOT 1
- π[bug] Resources failed with non-zero exit code: container failed with non-zero exit code: 80 HOT 5
- π[bug] Master refuses to accept agents connection HOT 4
- π€[question] Changing the default config path for the determined-agent.service HOT 5
- π€[question] Updating the default Determined-Pytorch container to 2.1/2.2 HOT 1
- π[bug] Running Mnist Tutorial distributed causes Runtime Errors and Hanging behavior HOT 12
- π€[question] dialing to http://172.22.0.1:32862: dial tcp 172.22.0.1:32862: connect: connection refused HOT 2
- π[bug] Kernel status: pending HOT 11
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from determined.