Giter Site home page Giter Site logo

Comments (24)

jdmarble avatar jdmarble commented on June 11, 2024 2

I'm seeing the same issue running in Kubernetes. Might be related to this bug in Alpine.
Edit: scratch that. I rebuilt using node:12-alpine3.10 and still had the problem.

from foundryvtt-docker.

jdmarble avatar jdmarble commented on June 11, 2024 2

I ported to node:12-slim to successfully work around the problem. I'm running into a lot of DNS issues on alpine based images. Not sure if it's my k8s cluster's configuration, or what.

from foundryvtt-docker.

jdmarble avatar jdmarble commented on June 11, 2024 2

I've resolved the DNS issue I've been having while running this and other Alpine based images in Kubernetes clusters on my network.

Short answer: I turned off DNSSEC for my domain name managed by Cloudflare and everything started working.

Read on for details.

Some information about my setup:

  • I use Cloudflare DNS to setup DNS TXT entries for letsencrypt so that my internal only servers can browser trusted certificates.
  • I don't use Cloudflare DNS for normal (A, AAAA, etc...) DNS records for my internal domain. I have an internal, Unbound DNS service for that.
  • Crucially, I had DNSSEC enabled for my internal domain in the Cloudflare DNS settings. I must have enabled it when I had different plans for that domain.

Some general information about what causes the problem for me (and possibly for you):

  • When Kubernetes starts a container, it adds search domains and options ndots:5 to /etc/resolve.conf inside the container
    • It copies the search domains from the host (my local domain, say, mylocaldomain.tld in my case) and adds a bunch of Kubernetes specific ones like cluster.local and svc.cluster.local.
    • This resolve.conf configuration has to do with looking up local services inside the cluster.
    • Aside: you can also override ndots to be "1" in each pod spec to solve the problem in another way
  • Now, when a DNS lookup for, say, foundryvtt.com is performed inside of a container, all of those search domains are checked first. For example, foundryvtt.com.svc.cluster.local then foundryvtt.com.cluster.local and foundryvtt.com.mylocaldomain.tld. Finally, if none of those other domains "resolve", then foundryvtt.com is checked.
    • The ...cluster.local domains are rejected by CoreDNS inside of the cluster, I guess. No beef with those.
    • foundryvtt.com.mylocaldomain.tld escapes the cluster and gets to my internal Unbound DNS server.
    • Unbound doesn't recognize it, so passes it, transparently, to another DNS server (8.8.8.8, Google's public DNS in my case).
      • Maybe I should configure Unbound to reject anything with that base domain that it doesn't recognize?
    • That DNS server recognizes the mylocaldomain.tld part and asks Cloudflare how to resolve it because Cloudflare is the authority on that particular domain.
    • Cloudflare would normally respond with NXDOMAIN, which, I guess (not a DNS expert here) means "doesn't exist". Instead, because I had DNSSEC enabled, it responds with NOERROR, but doesn't respond with an actual IP address. This is something like "I can neither confirm nor deny the existence of that or related domains". Read here about how Cloudflare justifies that response.
    • That "no comment" response winds its way back to the original requestor. Any non musl-based DNS client library would then shrug and continue looking through the search domains until it got to the implied '.' and tried 'foundryvtt.com' with a happy ending. musl will stop looking after recieving a NOERROR. Read here about how musl justifies that response.

Here are some links that helped me figure this out:

I could verify that this was a problem and that my fix worked using alpine/git and dig.

Before fix:

[jdmarble@jdmarble-desktop ~]$ kubectl run alpine-git --image=alpine/git --restart=Never -it --rm clone https://github.com/octocat/Spoon-Knife.git
fatal: unable to access 'https://github.com/octocat/Spoon-Knife.git/': Could not resolve host: github.com
...

(note that github.com did not resolve inside an Alpine based container inside of the cluster)

[jdmarble@jdmarble-desktop ~]$ dig github.com.mylocaldomain.tld
...
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 26637
...
;; AUTHORITY SECTION:
mylocaldomain.tld.		1720	IN	SOA	cleo.ns.cloudflare.com. dns.cloudflare.com. ...
...

(note the NOERROR response)

After fix:

[jdmarble@jdmarble-desktop ~]$ kubectl run alpine-git --image=alpine/git --restart=Never -it --rm clone https://github.com/octocat/Spoon-Knife.git
Cloning into 'Spoon-Knife'...
remote: Enumerating objects: 16, done.
remote: Total 16 (delta 0), reused 0 (delta 0), pack-reused 16
Receiving objects: 100% (16/16), done.
Resolving deltas: 100% (3/3), done.

(note that github.com resolved inside an Alpine based container inside of the cluster)

[jdmarble@jdmarble-desktop ~]$ dig github.com.myinternaldomain.tld
...
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 56469
...
;; AUTHORITY SECTION:
myinternaldomain.tld.		1044	IN	SOA	cleo.ns.cloudflare.com. dns.cloudflare.com. ...
...

(note the NXDOMAIN response)

In my case, it was an easy decision to disable DNSSEC because the domain is only used internally and I'm not using Cloudflare for normal records. If you want to keep DNSSEC on, you may have to get creative or switch away from Cloudflare.

from foundryvtt-docker.

annonch avatar annonch commented on June 11, 2024 1

I also have this networking issue in my k3s cluster. @jdmarble's repo worked :D

from foundryvtt-docker.

adam8797 avatar adam8797 commented on June 11, 2024 1

Sure thing. I'll test it this evening (or possibly tomorrow if I run out of time) and I'll post back here

from foundryvtt-docker.

annonch avatar annonch commented on June 11, 2024 1

In case this is helpful.
I noticed that the felddy/foundryvtt:improvement-debian worked fine, however the following are errors in felddy/foundryvtt:latest

Entrypoint | 2021-03-16 16:15:16 | [debug] Timezone set to: UTC
Entrypoint | 2021-03-16 16:15:16 | [info] Starting felddy/foundryvtt container v0.7.9
Entrypoint | 2021-03-16 16:15:16 | [debug] CONTAINER_VERBOSE set.  Debug logging enabled.
Entrypoint | 2021-03-16 16:15:16 | [info] No Foundry Virtual Tabletop installation detected.
Entrypoint | 2021-03-16 16:15:16 | [info] Using FOUNDRY_USERNAME and FOUNDRY_PASSWORD to authenticate.
Authenticate | 2021-03-16 16:15:16 | [debug] Saving cookies to: cookiejar.json
Authenticate | 2021-03-16 16:15:16 | [info] Requesting CSRF tokens from https://foundryvtt.com
Authenticate | 2021-03-16 16:15:16 | [debug] Fetching: https://foundryvtt.com
Authenticate | 2021-03-16 16:15:16 | [error] Unable to authenticate: request to https://foundryvtt.com/ failed, reason: getaddrinfo ENOTFOUND foundryvtt.com

Results Locally

Unable to find image 'node:14-alpine' locally
14-alpine: Pulling from library/node
e95f33c60a64: Pull complete 
0f691a8bb887: Pull complete 
daf9b71c0a0d: Pull complete 
d92a928c7b7d: Pull complete 
Digest: sha256:a75f7cc536062f9266f602d49047bc249826581406f8bc5a6605c76f9ed18e98
Status: Downloaded newer image for node:14-alpine
Server:         8.8.8.8
Address:        8.8.8.8:53

Non-authoritative answer:
Name:   foundryvtt.com
Address: 44.234.61.225

Non-authoritative answer:

inside k3s: (yaml included) (this also worked setting the dns server to 8.8.8.8)

apiVersion: batch/v1
kind: Job
metadata:
  name: hello
spec:
  template:
    # This is the pod template
    spec:
      containers:
      - name: dns-test
        image: node:14-alpine
        command: ['nslookup', 'foundryvtt.com']
      restartPolicy: OnFailure
---

Server:         10.43.0.10
Address:        10.43.0.10:53

Non-authoritative answer:

Non-authoritative answer:
Name:   foundryvtt.com
Address: 44.234.61.225

from foundryvtt-docker.

BitRacer avatar BitRacer commented on June 11, 2024 1

I have not been able to fixt his yet but I suspect this may be an issue with core DNS.

Lookups for foundryvtt.com appear to be failing because passthrough does not seem to be working

from coredns logs

[INFO] 10.1.182.28:51321 - 64102 "A IN foundryvtt.com.svc.cluster.local. udp 50 false 512" NXDOMAIN qr,aa,rd 143 0.000390493s
[INFO] 10.1.182.28:51321 - 41623 "A IN foundryvtt.com.cluster.local. udp 46 false 512" NXDOMAIN qr,aa,rd 139 0.000535954s
[INFO] 10.1.182.28:51321 - 17998 "A IN foundryvtt.com.local. udp 38 false 512" SERVFAIL qr,rd,ra 113 0.03611267s

no lookups for foundryvtt.com though.

from foundryvtt-docker.

hugoprudente avatar hugoprudente commented on June 11, 2024 1

I'll test again this on my 3 k8s clusters with the Alpine image (my default), and update here and in the other thread too. I'm still have the 8.8.8.8 on my CoreDNS so I'll try both, and edit this post

My 3 clusters runs today K8S Version

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.1", GitCommit:"632ed300f2c34f6d6d15ca4cef3d3c7073412212", GitTreeState:"clean", BuildDate:"2021-08-19T15:45:37Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T20:01:24Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}

Runnin CoreDNS k8s.gcr.io/coredns/coredns:v1.8.4

➜ k describe replicaset coredns-78fcd69978 -n kube-system
Name:           coredns-78fcd69978
Namespace:      kube-system
Selector:       k8s-app=kube-dns,pod-template-hash=78fcd69978
Labels:         k8s-app=kube-dns
                pod-template-hash=78fcd69978
Annotations:    deployment.kubernetes.io/desired-replicas: 2
                deployment.kubernetes.io/max-replicas: 3
                deployment.kubernetes.io/revision: 1
Controlled By:  Deployment/coredns
Replicas:       2 current / 2 desired
Pods Status:    2 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           k8s-app=kube-dns
                    pod-template-hash=78fcd69978
  Service Account:  coredns
  Containers:
   coredns:
    Image:       k8s.gcr.io/coredns/coredns:v1.8.4
    Ports:       53/UDP, 53/TCP, 9153/TCP
    Host Ports:  0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
    Readiness:    http-get http://:8181/ready delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (ro)
  Volumes:
   config-volume:
    Type:               ConfigMap (a volume populated by a ConfigMap)
    Name:               coredns
    Optional:           false
  Priority Class Name:  system-cluster-critical
Events:                 <none>

Confirmed with the same

Authenticate | 2022-01-24 19:52:07 | [error] Unable to authenticate: request to https://foundryvtt.com/auth/login/ failed, reason: getaddrinfo EAI_AGAIN foundryvtt.com

I have found something interesting that may solve the issue.

Though the call to dns.lookup() will be asynchronous from JavaScript's perspective, it is implemented as a synchronous call to getaddrinfo(3) that runs on libuv's threadpool. This can have surprising negative performance implications for some applications, see the UV_THREADPOOL_SIZE documentation for more information.
and from:

https://nodejs.org/api/cli.html#cli_uv_threadpool_size_size
more here: https://medium.com/@amirilovic/how-to-fix-node-dns-issues-5d4ec2e12e95

This solved my issue running 200 deployments.

from foundryvtt-docker.

felddy avatar felddy commented on June 11, 2024

Thanks for the research on this. I'm not entirely against switching the base image from Alpine to Debian. I'd like give upstream a bit of time to resolve this before jumping ship.

@jdmarble what was the impact to the image size using Debian-slim?

from foundryvtt-docker.

jdmarble avatar jdmarble commented on June 11, 2024

I expected the Debian (even slim) based image to be larger than the Alpine one. I was surprised, although I'm not sure I can trust the results because I don't understand them. I'm getting different numbers depending on the source.

$ podman image ls
REPOSITORY                                      TAG            IMAGE ID      CREATED      SIZE
registry.gitlab.com/jdmarble/foundryvtt-docker  develop        6ad53b690aeb  3 days ago   106 MB
docker.io/felddy/foundryvtt                     latest         e3706094d2a7  2 weeks ago  111 MB

The Gitlab repo reports for my "slim" spin: 32.56 MiB (edit: 34.14MB)
Your image size badge reports: 34 MB
Docker Hub reports the compressed image size for felddy/foundryvtt as 33.92 MB.
I tried pushing my image to Docker Hub to get an apples-to-apples, but it's taking a while to show up.

Maybe podman is reporting uncompressed size?

Regardless, I wouldn't suggest something as drastic as a base image change only to fix this type of problem, but if a slightly smaller image size is interesting (if it's true). :)

from foundryvtt-docker.

adam8797 avatar adam8797 commented on June 11, 2024

I think I'm being affected by this issue too, but in the weirdest way I could imagine. I've spent the last 4 hours debugging and searching lol. I'm spinning this up in Kubernetes.

Started when I got errors talking about rejected certs during the download process. I managed to get a shell into a container, and voila!

(all four commands ran in quick succession)
image

The 404s are from my on public facing traefik instance, and then it eventually curls correctly, randomly. The next request was back to the 404s

I'm going to try building the image myself from different bases like @jdmarble did, but this is just an impact report I guess

Edit Bless you jdmarble you forked and pushed your port. May the coding gods smile upon you

from foundryvtt-docker.

adam8797 avatar adam8797 commented on June 11, 2024

Update: Looks like that was unsuccessful. I was able to build the image successfully, but I still have the same problem. Sorry for the noise. Considering this may be unrelated, I can move my information to another ticket if you prefer.

from foundryvtt-docker.

felddy avatar felddy commented on June 11, 2024

I had hoped upstream would have fixed this issue in busybox, but that doesn't seem to be happening. Also, this is starting to affect more people.

I have started a branch using the node:14-slim base image:
https://github.com/felddy/foundryvtt-docker/tree/improvement/debian

I'm a little concerned about the size increase (but it is not a show stopper):

❱ docker images | grep foundry
felddy/foundryvtt                       0.7.9-slim        ce29f9a2bc03   44 minutes ago      195MB
felddy/foundryvtt                       0.8.0             f676a803cfcb   3 weeks ago         126MB
felddy/foundryvtt                       release           e3706094d2a7   2 months ago        103MB
felddy/foundryvtt                       release-0.7.9     38a78b0459a4   2 months ago        103MB

The bigger issue that I need to resolve is that only half of the architectures supported by Alpine are offered by Debian:

os/arch node:14-alpine node:14-slim
linux/amd64
linux/arm/v6
linux/arm/v7
linux/arm64/v8
linux/ppc64le
linux/s390x

I don't have any idea how many users this would impact. I'd guess that loss of arm/v6 would be the biggest impact. I know a good number of people run Foundry on Raspberry Pis and this would remove support for the RPi 1 B and RPi 1 B+.

In any case, if you'd like to test the image from this branch it is available to be pulled as felddy/foundryvtt:improvement-debian. I would appreciate any feedback from the folks on this issue since I don't have a K8s cluster readily available.

If you have any comments about the limited architectures, that would also be helpful.

from foundryvtt-docker.

felddy avatar felddy commented on June 11, 2024

Could I also get folks to try running this and posting the results. I'm unable to reproduce the behavior here, and want to verify that it hasn't been fixed upstream:

❱ docker run -it --rm --dns 8.8.8.8 node:14-alpine nslookup foundryvtt.com
Server:		8.8.8.8
Address:	8.8.8.8:53

Non-authoritative answer:

Non-authoritative answer:
Name:	foundryvtt.com
Address: 44.234.61.225

from foundryvtt-docker.

felddy avatar felddy commented on June 11, 2024

@annonch Those are promising results.

When you get a chance could you check if the nightly build is exhibiting the same behavior as the last release: felddy/foundryvtt:nightly

If node:14-alpine is working, I'd expect that felddy/foundryvtt:nightly should work as well.

🤞

from foundryvtt-docker.

adam8797 avatar adam8797 commented on June 11, 2024

Unfortunately I can't provide such good results. I'm running these in kubernetes

I just curled the foundry website to to test resolution. Here I used wc to condense the output. But the 4 word result is the bad DNS resolution, the 698 word result is the proper web page.

I tried this against improvement-debain but the behavior is still there:

weird_dns_behavior_1

And against nightly it was all 4s.. I didn't get a single good hit to the foundry website.

Now, if I'm the only one here I'm willing to concede that its just my setup, this may be unrelated, and I'm just making noise 😆

I can work around by setting my DNS policy to None and manually assigning DNS servers.

from foundryvtt-docker.

hugoprudente avatar hugoprudente commented on June 11, 2024

@adam8797 how are you running your K8S I never had an issue with the DNS resolution using the alpine container.

I did a 1000 requests in a row using @felddy example command and they all came out clean.

I know that with K8S sometimes policies or security groups if you are using in AWS can result in some inconsistent DNS resolutions. I'm running foundry today in RPi4 with k3s, local with composer, and in a server with KIND and k8s for development and testing.

If you guys have any other set of tests that I could run please let me know.

from foundryvtt-docker.

aetaric avatar aetaric commented on June 11, 2024

I am also having this issue on a k8s cluster setup via kubeadm. This is the only container exhibiting the behavior and does so on both nightly and release-0.7.9. Not sure if it matters, but my k8s cluster is using CoreDNS and not kube-dns.

from foundryvtt-docker.

hugoprudente avatar hugoprudente commented on June 11, 2024

Hi @aetaric how the network on your clusters are configured, I saw problems with k8s and CoreDNS naming resolution due to the security groups and firewalls connections between the nodes.

On all my environments I never had issues and my K8S development that runs on AWS with EKS also has CoreDNS and doesn't have the problem.

from foundryvtt-docker.

aetaric avatar aetaric commented on June 11, 2024

Well, I am using flannel as the backing network fabric. So no network policy antics should be going on. I am running in vxlan mode for communication between nodes so that might have something to do with it?

As for physical and logical networking, all k8s nodes are same VLAN, same ToR switch, same subnet.

As I mentioned before, other containers are able to resolve DNS without issue and improvement-debain does seem to work, if not perfectly, well enough for the container to pull the app distribution and license info.

from foundryvtt-docker.

aetaric avatar aetaric commented on June 11, 2024

So I might have some insight into what the container is doing weird here. I was reviewing my DNS query logs and it seems the container is appending the search domain from DHCP options to the foundry address.

Got query for foundryvtt.com.k8s.domain.tld|A from 192.168.9.254:8517, relayed to 192.168.8.100:53
Got query for foundryvtt.com.k8s.domain.tld|AAAA from 192.168.9.254:62140, relayed to 192.168.8.100:53

from foundryvtt-docker.

hugoprudente avatar hugoprudente commented on June 11, 2024

I have upgraded my K8S Cluster to 1.22 and first time I got this error.

Just to let registered here the fix for me was ensure that CoreDNS was sending the resolution to a external resolver adding the 8.8.8.8 to the ConfigMap

Data
====
Corefile:
----
.:53 {
    errors
    health {
       lameduck 5s
    }
    ready
    kubernetes cluster.local in-addr.arpa ip6.arpa {
       pods insecure
       fallthrough in-addr.arpa ip6.arpa
       ttl 30
    }
    prometheus :9153
    forward . 8.8.8.8 /etc/resolv.conf {
       max_concurrent 1000
    }
    cache 30
    loop
    reload
    loadbalance
}

from foundryvtt-docker.

github-actions avatar github-actions commented on June 11, 2024

This issue has been automatically marked as stale because it has been inactive for 28 days. To reactivate the issue, simply post a comment with the requested information to help us diagnose this issue. If this issue remains inactive for another 7 days, it will be automatically closed.

from foundryvtt-docker.

github-actions avatar github-actions commented on June 11, 2024

This issue has been automatically closed due to inactivity. If you are still experiencing problems, please open a new issue.

from foundryvtt-docker.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.