Giter Site home page Giter Site logo

Comments (13)

PieroMacaluso avatar PieroMacaluso commented on May 22, 2024 1

Thanks @rb-determined-ai for your illuminating comment! It finally worked out!

I just post there the additional steps to solve this problem:

  1. Localize the id of the network used by the cluster using docker network ls and take out the .
  2. Run ifconfig on the cluster machine and verify that there is an interface called br-<ID>
  3. Run iptables -A INPUT -i br-<ID> -j ACCEPT

Let me know if I can contribute with you to put this information in the FAQ Docs or directly in the installation docs.

Thanks for your precious help.

from determined.

vishnu2kmohan avatar vishnu2kmohan commented on May 22, 2024

Hi @PieroMacaluso can you give us a few more details about the machine on which you've spawned the cluster?

What OS are you running? Mac/Linux?

What's the output of docker network ls, and ip a & ip route if on Linux, or netstat -rn if on Mac?

from determined.

PieroMacaluso avatar PieroMacaluso commented on May 22, 2024

Hi @vishnu2kmohan and thanks for your reply!

I am running this on: Ubuntu 20.04 LTS

Here are the outputs I get

  • docker network ls
NETWORK ID          NAME                 DRIVER              SCOPE
6289933d84f3        bridge               bridge              local
9f0ff59e2639        determined_default   bridge              local
ebc032543c8a        host                 host                local
e8909223a859        none                 null                local
  • ip a (privacy filter applied)
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp2s0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether d4:5d:64:07:a4:41 brd ff:ff:ff:ff:ff:ff
3: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether <privacy-filter> brd ff:ff:ff:ff:ff:ff
    inet <privacy-filter> brd <privacy-filter> scope global eno1
       valid_lft forever preferred_lft forever
    inet6 <privacy-filter> scope link 
       valid_lft forever preferred_lft forever
5: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 02:42:52:f6:cc:7c brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:52ff:fef6:cc7c/64 scope link 
       valid_lft forever preferred_lft forever
555: vethc4b3d3f@if554: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default 
    link/ether 2a:0a:05:0d:5e:bb brd ff:ff:ff:ff:ff:ff link-netnsid 3
    inet6 fe80::280a:5ff:fe0d:5ebb/64 scope link 
       valid_lft forever preferred_lft forever
556: br-9f0ff59e2639: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 02:42:f1:f6:02:2b brd ff:ff:ff:ff:ff:ff
    inet 192.168.64.1/20 brd 192.168.79.255 scope global br-9f0ff59e2639
       valid_lft forever preferred_lft forever
    inet6 fe80::42:f1ff:fef6:22b/64 scope link 
       valid_lft forever preferred_lft forever
558: vethbb830b8@if557: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-9f0ff59e2639 state UP group default 
    link/ether 4e:9e:f7:4b:ff:0d brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::4c9e:f7ff:fe4b:ff0d/64 scope link 
       valid_lft forever preferred_lft forever
560: veth497c3f0@if559: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-9f0ff59e2639 state UP group default 
    link/ether 0a:49:fc:ec:65:a5 brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet6 fe80::849:fcff:feec:65a5/64 scope link 
       valid_lft forever preferred_lft forever
336: veth1bec14a@if335: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default 
    link/ether be:cd:46:16:23:d1 brd ff:ff:ff:ff:ff:ff link-netnsid 2
    inet6 fe80::bccd:46ff:fe16:23d1/64 scope link 
       valid_lft forever preferred_lft forever
  • ip route (privacy filter applied)
default via <privacy-filter> dev eno1 proto static 
<privacy-filter> dev eno1 proto kernel scope link src <privacy-filter>
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 
192.168.64.0/20 dev br-9f0ff59e2639 proto kernel scope link src 192.168.64.1

from determined.

vishnu2kmohan avatar vishnu2kmohan commented on May 22, 2024

when I try to open the notebook from the browser

@PieroMacaluso how are you trying to open the Notebook from the browser?

You should be able to load up the Determined WebUI by pointing to http://localhost:8080 (or to http://<Hostname or IP of your Ubuntu machine>:8080 if you're not using a browser launched on the Ubuntu machine itself) and on the Dashboard (which is the default page that loads up, on which) you should see an active task card for the Notebook, which, when clicked, should load it up.

Alternatively, you can navigate to the Notebooks section on the left sidebar of the WebUI, and click the Open button against the entry, to launch it from there.

Finally, you should also be able to access it directly by visiting https://<Hostname/IP of your Ubuntu machine>:8080/proxy/c91200a8-bae3-4c61-b3e9-07af6d2fc51e/lab/tree/Notebook.ipynb (assuming the the UUID of the Notebook that was spawned (still) is c91200a8-bae3-4c61-b3e9-07af6d2fc51e)

from determined.

PieroMacaluso avatar PieroMacaluso commented on May 22, 2024

I tried to open the notebooks by using the approaches you described.

If I open the notebook from the open button, the system stays on this page for seconds.
Redirecting

Then it redirects to https://<Hostname/IP of your Ubuntu machine>:8080/proxy/c91200a8-bae3-4c61-b3e9-07af6d2fc51e/lab/tree/Notebook.ipynb?reset, but it returns ERROR 502.

To force this behaviour I tried to use the last approach https://<Hostname/IP of your Ubuntu machine>:8080/proxy/c91200a8-bae3-4c61-b3e9-07af6d2fc51e/lab/tree/Notebook.ipynb but still I receive ERROR 502.

from determined.

rb-determined-ai avatar rb-determined-ai commented on May 22, 2024

@PieroMacaluso, thanks for reporting this, this definitely should work.

Would you mind sharing a portion of your master logs that covers the window starting with when the notebook started and includes the moment when you hit the 502 error? You can find them from the Web UI in the lower left-hand corner.

Also, it would be helpful if we could see how the agent was launched. Could you run docker ps to get the container ID of the container running the determined-agent image, and then send the output of the docker exec <the agent container id> env.

Also, if you could share agent logs, found via docker container logs <the agent container id>, that would be excellent.

Finally, if you could share more about the shape of your <Hostname/IP of your Ubuntu machine>, that would help. We have hit bugs in the past where we could handle an address like xxx.xxx.xxx.xxx but where we didn't properly handle [::xxxx:xxxx] or something, and it would be good to confirm this isn't another instance of something like that.

from determined.

PieroMacaluso avatar PieroMacaluso commented on May 22, 2024

Thanks @rb-determined-ai for your time.

Would you mind sharing a portion of your master logs that covers the window starting with when the notebook started and includes the moment when you hit the 502 error? You can find them from the Web UI in the lower left-hand corner.

2020-07-23, 17:17:00  info    creating notebook  id="notebooks" system="master" type="notebookManager"
2020-07-23, 17:17:00  info    created notebook e2846689-1d3f-4a15-9310-af086b7aa6f8  id="notebooks" system="master" type="notebookManager"
2020-07-23, 17:17:01  info    registering service: e2846689-1d3f-4a15-9310-af086b7aa6f8 (http://130.192.93.60:32817)  id="proxy" system="master" type="Proxy"
2020-07-23, 17:17:04  info    readiness check passed: notebook  id="e2846689-1d3f-4a15-9310-af086b7aa6f8" system="master" type="command"
#### Page hits HTTP Error 502 #### <- this is not part of the log
2020-07-23, 17:17:37  error   error while actor was running  error="websocket: close 1001 (going away)" id="7bfea5eb-37d0-4ad7-bef6-a75559cecd45" system="master" type="websocketActor"
2020-07-23, 17:17:37  error   websocket: close 1001 (going away)
2020-07-23, 17:17:37  error   http: connection has been hijacked

Also, it would be helpful if we could see how the agent was launched. Could you run docker ps to get the container ID of the container running the determined-agent image, and then send the output of the docker exec <the agent container id> env.

PATH=/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=a183c3940ec8
PYTHONUSERBASE=/run/determined/pythonuserbase
DET_TASK_ID=e2846689-1d3f-4a15-9310-af086b7aa6f8
DET_CLUSTER_ID=8016b3be-f118-412b-88eb-3c68100ae9d2
DET_MASTER_ID=6379c0b9-1a50-4b81-85eb-540bbfcf4e40
DET_MASTER=<privacy-filter>:8080
DET_MASTER_HOST=<privacy-filter>
DET_MASTER_ADDR=<privacy-filter>
DET_MASTER_PORT=8080
DET_AGENT_ID=determined-agent-0
DET_CONTAINER_ID=a9666ffe-7034-4b1c-a4ad-65e6cad5fa25
DET_SLOT_IDS=[3]
DET_USE_GPU=true
CUDA_VERSION=10.0.130
CUDA_PKG_VERSION=10-0=10.0.130-1
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NVIDIA_REQUIRE_CUDA=cuda>=10.0 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=410,driver<411
NCCL_VERSION=2.4.8
LIBRARY_PATH=/usr/local/cuda/lib64/stubs
CUDNN_VERSION=7.6.5.32
LANG=C.UTF-8
LC_ALL=C.UTF-8
CONDA_DIR=/opt/conda
PYTHONUNBUFFERED=1
PYTHONFAULTHANDLER=1
PYTHONHASHSEED=0
JUPYTER_CONFIG_DIR=/run/determined/jupyter/config
JUPYTER_DATA_DIR=/run/determined/jupyter/data
JUPYTER_RUNTIME_DIR=/run/determined/jupyter/runtime
HOME=/root

Also, if you could share agent logs, found via docker container logs , that would be excellent.

Processing /opt/determined/wheels/determined-0.12.11-py3-none-any.whl
Processing /opt/determined/wheels/determined_cli-0.12.11-py3-none-any.whl
Processing /opt/determined/wheels/determined_common-0.12.11-py3-none-any.whl
Requirement already satisfied: psutil in /opt/conda/lib/python3.6/site-packages (from determined==0.12.11) (5.7.0)
Requirement already satisfied: cloudpickle==0.5.3 in /opt/conda/lib/python3.6/site-packages (from determined==0.12.11) (0.5.3)
Requirement already satisfied: lomond==0.3.3 in /opt/conda/lib/python3.6/site-packages (from determined==0.12.11) (0.3.3)
Requirement already satisfied: packaging==19.0 in /opt/conda/lib/python3.6/site-packages (from determined==0.12.11) (19.0)
Requirement already satisfied: numpy>=1.16.2 in /opt/conda/lib/python3.6/site-packages (from determined==0.12.11) (1.18.5)
Requirement already satisfied: simplejson==3.16.0 in /opt/conda/lib/python3.6/site-packages (from determined==0.12.11) (3.16.0)
Requirement already satisfied: matplotlib in /opt/conda/lib/python3.6/site-packages (from determined==0.12.11) (3.2.1)
Requirement already satisfied: requests>=2.20.0 in /opt/conda/lib/python3.6/site-packages (from determined==0.12.11) (2.23.0)
Requirement already satisfied: dill>=0.2.9 in /opt/conda/lib/python3.6/site-packages (from determined==0.12.11) (0.3.1.1)
Requirement already satisfied: boto3>=1.9.220 in /opt/conda/lib/python3.6/site-packages (from determined==0.12.11) (1.14.0)
Collecting yogadl==0.1.1
  Using cached yogadl-0.1.1-py3-none-any.whl (32 kB)
Requirement already satisfied: h5py>=2.9.0 in /opt/conda/lib/python3.6/site-packages (from determined==0.12.11) (2.10.0)
Requirement already satisfied: pyzmq==18.1.0 in /opt/conda/lib/python3.6/site-packages (from determined==0.12.11) (18.1.0)
Requirement already satisfied: gitpython==2.1.11 in /opt/conda/lib/python3.6/site-packages (from determined-cli==0.12.11) (2.1.11)
Requirement already satisfied: python-dateutil==2.8.0 in /opt/conda/lib/python3.6/site-packages (from determined-cli==0.12.11) (2.8.0)
Requirement already satisfied: ruamel.yaml>=0.15.78 in /opt/conda/lib/python3.6/site-packages (from determined-cli==0.12.11) (0.16.10)
Requirement already satisfied: tabulate>=0.8.3 in /opt/conda/lib/python3.6/site-packages (from determined-cli==0.12.11) (0.8.7)
Requirement already satisfied: termcolor==1.1.0 in /opt/conda/lib/python3.6/site-packages (from determined-cli==0.12.11) (1.1.0)
Requirement already satisfied: argcomplete==1.9.4 in /opt/conda/lib/python3.6/site-packages (from determined-cli==0.12.11) (1.9.4)
Requirement already satisfied: hdfs>=2.2.2 in /opt/conda/lib/python3.6/site-packages (from determined-common==0.12.11) (2.5.8)
Requirement already satisfied: pathspec>=0.6.0 in /opt/conda/lib/python3.6/site-packages (from determined-common==0.12.11) (0.8.0)
Requirement already satisfied: google-cloud-storage>=1.20.0 in /opt/conda/lib/python3.6/site-packages (from determined-common==0.12.11) (1.28.1)
Requirement already satisfied: six>=1.10.0 in /opt/conda/lib/python3.6/site-packages (from lomond==0.3.3->determined==0.12.11) (1.15.0)
Requirement already satisfied: pyparsing>=2.0.2 in /opt/conda/lib/python3.6/site-packages (from packaging==19.0->determined==0.12.11) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.6/site-packages (from matplotlib->determined==0.12.11) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/lib/python3.6/site-packages (from matplotlib->determined==0.12.11) (1.2.0)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.6/site-packages (from requests>=2.20.0->determined==0.12.11) (2020.4.5.1)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.6/site-packages (from requests>=2.20.0->determined==0.12.11) (2.9)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/lib/python3.6/site-packages (from requests>=2.20.0->determined==0.12.11) (1.25.8)
Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/lib/python3.6/site-packages (from requests>=2.20.0->determined==0.12.11) (3.0.4)
Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /opt/conda/lib/python3.6/site-packages (from boto3>=1.9.220->determined==0.12.11) (0.3.3)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /opt/conda/lib/python3.6/site-packages (from boto3>=1.9.220->determined==0.12.11) (0.10.0)
Requirement already satisfied: botocore<1.18.0,>=1.17.0 in /opt/conda/lib/python3.6/site-packages (from boto3>=1.9.220->determined==0.12.11) (1.17.0)
Requirement already satisfied: websockets in /opt/conda/lib/python3.6/site-packages (from yogadl==0.1.1->determined==0.12.11) (8.1)
Requirement already satisfied: async-generator in /opt/conda/lib/python3.6/site-packages (from yogadl==0.1.1->determined==0.12.11) (1.10)
Requirement already satisfied: lmdb in /opt/conda/lib/python3.6/site-packages (from yogadl==0.1.1->determined==0.12.11) (0.98)
Requirement already satisfied: filelock in /opt/conda/lib/python3.6/site-packages (from yogadl==0.1.1->determined==0.12.11) (3.0.12)
Requirement already satisfied: gitdb2>=2.0.0 in /opt/conda/lib/python3.6/site-packages (from gitpython==2.1.11->determined-cli==0.12.11) (4.0.2)
Requirement already satisfied: ruamel.yaml.clib>=0.1.2; platform_python_implementation == "CPython" and python_version < "3.9" in /opt/conda/lib/python3.6/site-packages (from ruamel.yaml>=0.15.78->determined-cli==0.12.11) (0.2.0)
Requirement already satisfied: docopt in /opt/conda/lib/python3.6/site-packages (from hdfs>=2.2.2->determined-common==0.12.11) (0.6.2)
Requirement already satisfied: google-cloud-core<2.0dev,>=1.2.0 in /opt/conda/lib/python3.6/site-packages (from google-cloud-storage>=1.20.0->determined-common==0.12.11) (1.3.0)
Requirement already satisfied: google-resumable-media<0.6dev,>=0.5.0 in /opt/conda/lib/python3.6/site-packages (from google-cloud-storage>=1.20.0->determined-common==0.12.11) (0.5.1)
Requirement already satisfied: google-auth<2.0dev,>=1.11.0 in /opt/conda/lib/python3.6/site-packages (from google-cloud-storage>=1.20.0->determined-common==0.12.11) (1.16.1)
Requirement already satisfied: docutils<0.16,>=0.10 in /opt/conda/lib/python3.6/site-packages (from botocore<1.18.0,>=1.17.0->boto3>=1.9.220->determined==0.12.11) (0.15.2)
Requirement already satisfied: gitdb>=4.0.1 in /opt/conda/lib/python3.6/site-packages (from gitdb2>=2.0.0->gitpython==2.1.11->determined-cli==0.12.11) (4.0.5)
Requirement already satisfied: google-api-core<2.0.0dev,>=1.16.0 in /opt/conda/lib/python3.6/site-packages (from google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.20.0->determined-common==0.12.11) (1.20.0)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/conda/lib/python3.6/site-packages (from google-auth<2.0dev,>=1.11.0->google-cloud-storage>=1.20.0->determined-common==0.12.11) (0.2.8)
Requirement already satisfied: cachetools<5.0,>=2.0.0 in /opt/conda/lib/python3.6/site-packages (from google-auth<2.0dev,>=1.11.0->google-cloud-storage>=1.20.0->determined-common==0.12.11) (4.1.0)
Requirement already satisfied: rsa<4.1,>=3.1.4 in /opt/conda/lib/python3.6/site-packages (from google-auth<2.0dev,>=1.11.0->google-cloud-storage>=1.20.0->determined-common==0.12.11) (4.0)
Requirement already satisfied: setuptools>=40.3.0 in /opt/conda/lib/python3.6/site-packages (from google-auth<2.0dev,>=1.11.0->google-cloud-storage>=1.20.0->determined-common==0.12.11) (47.1.1.post20200604)
Requirement already satisfied: smmap<4,>=3.0.1 in /opt/conda/lib/python3.6/site-packages (from gitdb>=4.0.1->gitdb2>=2.0.0->gitpython==2.1.11->determined-cli==0.12.11) (3.0.4)
Requirement already satisfied: googleapis-common-protos<2.0dev,>=1.6.0 in /opt/conda/lib/python3.6/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.20.0->determined-common==0.12.11) (1.52.0)
Requirement already satisfied: pytz in /opt/conda/lib/python3.6/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.20.0->determined-common==0.12.11) (2020.1)
Requirement already satisfied: protobuf>=3.4.0 in /opt/conda/lib/python3.6/site-packages (from google-api-core<2.0.0dev,>=1.16.0->google-cloud-core<2.0dev,>=1.2.0->google-cloud-storage>=1.20.0->determined-common==0.12.11) (3.12.2)
Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /opt/conda/lib/python3.6/site-packages (from pyasn1-modules>=0.2.1->google-auth<2.0dev,>=1.11.0->google-cloud-storage>=1.20.0->determined-common==0.12.11) (0.4.8)
Installing collected packages: determined-common, yogadl, determined, determined-cli
Successfully installed determined-0.12.11 determined-cli-0.12.11 determined-common-0.12.11 yogadl-0.1.1
[I 15:17:04.131 LabApp] Writing notebook server cookie secret to /run/determined/jupyter/runtime/notebook_cookie_secret
[W 15:17:04.446 LabApp] All authentication is disabled.  Anyone who can connect to this server will be able to run code.
[I 15:17:04.453 LabApp] JupyterLab extension loaded from /opt/conda/lib/python3.6/site-packages/jupyterlab
[I 15:17:04.453 LabApp] JupyterLab application directory is /opt/conda/share/jupyter/lab
[I 15:17:04.455 LabApp] Serving notebooks from local directory: /run/determined/workdir
[I 15:17:04.455 LabApp] The Jupyter Notebook is running at:
[I 15:17:04.455 LabApp] http://a183c3940ec8:8888/proxy/e2846689-1d3f-4a15-9310-af086b7aa6f8/
[I 15:17:04.455 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

Finally, if you could share more about the shape of your <Hostname/IP of your Ubuntu machine>, that would help. We have hit bugs in the past where we could handle an address like xxx.xxx.xxx.xxx but where we didn't properly handle [::xxxx:xxxx] or something, and it would be good to confirm this isn't another instance of something like that.

The IP shape of my Ubuntu machine is XXX.XXX.XX.XX

It seems to me that the link http://XXX.XXX.XX.XX/proxy/e2846689-1d3f-4a15-9310-af086b7aa6f8/ does not properly redirect the connection to http://XXX.XXX.XX.XX:32817/proxy/e2846689-1d3f-4a15-9310-af086b7aa6f8/. If I put the last-mentioned link in the browser, it works! However, I noticed that the problem with the connection occurs even with the usage of Native API.

from determined.

rb-determined-ai avatar rb-determined-ai commented on May 22, 2024

I spent some time today trying to reproduce this on a fresh Ubuntu 20.04 machine, but no luck.

Ok, so we know if you are able to connect to the notebook by visiting http://XXX.XXX.XX.XX:32817/proxy/e2846689-1d3f-4a15-9310-af086b7aa6f8 that the notebook is running and the port is accessible.

We also know based on the on the registering service: ... line from the master logs that the master is going to proxy connections for the UUID on that line towards the stuff inside of the parentheses on that line, which looks like it should absolutely be the same http://XXX.XXX.XX.XX:32817 that we just manually visited. But somehow that is failing, which is why we get a 502 error (although it is odd that there's no error message in the master logs).

The master is inside a container, so I would normally suspect name resolution or routing issues. However, if every time that you inserted <privacy-filter> it was the same public XXX.XXX.XX.XX ip address, then neither resolution nor routing could be a problem. (sidenote: were all instances of <privacy-filter> the same public ip address?)

Then the only thing I can think of that is causing the problem would be if there was some firewall rule that was disallowing docker containers from talking to open ports on the host or something like that. That is the only explanation I can come up with for why you would be able to connect to port 32817 externally but the master would fail to connect to the same port.

That's easy enough to check... Could you identify the docker container for the determinedai/determined-master image, then do something like this:

docker exec CONTAINER_ID apt update
docker exec CONTAINER_ID apt install curl -y
docker exec CONTAINER_ID curl -D - http://XXX.XXX.XX.XX:32817/proxy/e2846689-1d3f-4a15-9310-af086b7aa6f8/lab/tree/Notebook.ipynb

Also, just for good measure, would you mind sending the agent environment and the the agent container logs? In your last response you sent the environment and the container logs for the notebook container (which was also helpful). The agent docker container should be the one running the determinedai/determined-agent:0.12.11 image.

Thank you for your patience with all of this.

from determined.

PieroMacaluso avatar PieroMacaluso commented on May 22, 2024

Thanks for your patience and kindness @rb-determined-ai! Here it is what I have found.

The master is inside a container, so I would normally suspect name resolution or routing issues. However, if every time that you inserted it was the same public XXX.XXX.XX.XX ip address, then neither resolution nor routing could be a problem. (sidenote: were all instances of the same public ip address?)

I used always the same public IP address.

docker exec CONTAINER_ID apt update
docker exec CONTAINER_ID apt install curl -y
docker exec CONTAINER_ID curl -D - http://XXX.XXX.XX.XX:32817/proxy/e2846689-1d3f-4a15-9310-af086b7aa6f8/lab/tree/Notebook.ipynb

I executed this command on the machine (replacing correct variables) and the curl command does not receive an answer. I tried to leave it opened for 2 minutes.

Also, just for good measure, would you mind sending the agent environment and the agent container logs?

Agent Environment: determinedai/determined-agent:0.12.11
Output of Agent Logs from the start of a new notebook:

INFO[2020-07-27T10:09:41Z] transitioning state from ASSIGNED to PULLING  id=17673215-55e8-495a-9f65-439433bd3afc system=determined-agent-0 type=containerActor
WARN[2020-07-27T10:09:41Z] can't retrieve any credential stores: can't open docker config: open /root/.docker/config.json: no such file or directory  id=docker system=determined-agent-0 type=dockerActor
INFO[2020-07-27T10:09:41Z] transitioning state from PULLING to STARTING  id=17673215-55e8-495a-9f65-439433bd3afc system=determined-agent-0 type=containerActor
INFO[2020-07-27T10:09:42Z] transitioning state from STARTING to RUNNING  id=17673215-55e8-495a-9f65-439433bd3afc system=determined-agent-0 type=containerActor

(The ID in this part are different because I had to restart the cluster)

from determined.

PieroMacaluso avatar PieroMacaluso commented on May 22, 2024

curl is working with http://488f9e1cafe7:8888/proxy/bf8271a4-04b3-4c3c-bc3e-ba3cbe01c9d5/lab/tree/Notebook.ipynb

from determined.

rb-determined-ai avatar rb-determined-ai commented on May 22, 2024

Ok, cool. This narrows the issue down quite a bit. The fact that this works

curl http://488f9e1cafe7:8888/proxy/bf8271a4-04b3-4c3c-bc3e-ba3cbe01c9d5/lab/tree/Notebook.ipynb

but this doesn't

curl -D - http://XXX.XXX.XX.XX:32817/proxy/e2846689-1d3f-4a15-9310-af086b7aa6f8/lab/tree/Notebook.ipynb

confirms that the issue is a firewall issue. Because our system is designed to operate on a multi-machine cluster, we would not try to use docker to deal with bridged networking; i.e. we would never try to contact the notebook by trying to resolve the 488f9e1cafe7:8888 hostname. Instead, we configure the agent to expose a port from the container which it launches, and the master will reach out to the agent's IP address, at that port, which will get forwarded to the task container. It would look like this:

incoming to    proxied to
 master IP      agent IP
----------+   +----------+
          |   |          |
       ___v___|___    ___v______________
      | |       | |  | | |            | |
      | | proxy | |  | | port forward | |
      | |_______| |  | |_|____________| |
      |           |  | | v            | |
      |   master  |  | |   notebook   | |
      |___________|  | |______________| |
                     |                  |
                     |       agent      |
                     |__________________|

The whole reason for this proxying layer is that we know that the master is going to be accessible externally, but there is no guarantee that the agent is externally accessible; only that the agent is accessible from the master.

So this should shed some light on why det-deploy local configures the agent and master via the external IP address of your machine; it allows both the agent and the master to think of each other as being on different machines in a way that is transparent to our system.

Now, we know that because this does not work

curl -D - http://XXX.XXX.XX.XX:32817/proxy/e2846689-1d3f-4a15-9310-af086b7aa6f8/lab/tree/Notebook.ipynb

that you either have a firewall rule blocking incoming connections to your machine from docker, or that you have a DROP or REJECT policy and you have not explicitly allowed incoming connections from docker.

If it is the second case, you might be able to enable the relevant connections via:

sudo iptables -A INPUT -i docker0 -j ACCEPT

from determined.

rb-determined-ai avatar rb-determined-ai commented on May 22, 2024

@PieroMacaluso Sorry for the delayed response, but if you wanted to contribute this to the FAQ Docs, that would be fantastic! We had another community member on our community slack hit the same issue actually.

from determined.

PieroMacaluso avatar PieroMacaluso commented on May 22, 2024

I will soon contribute to the FAQ Docs! Thanks for your time and help! πŸ‘

from determined.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.