Giter Site home page Giter Site logo

Comments (12)

ibalajiarun avatar ibalajiarun commented on May 16, 2024

@squat do you have any ideas? I am trying to connect three clusters with 30 nodes each, and doing a full mesh is taking too much CPU. thanks in advance!

from kilo.

squat avatar squat commented on May 16, 2024

Hi @ibalajiarun :) it's very exciting that you applying Kilo to such large clusters. I am running logical messages on several clusters currently, so I suspect there is something at okay with your specific deployment. Can you please share the following details?

  • What cliff environment are you running in? AWS, private cloud, digital ocean etc
  • What encapsulation option are you using? Always, cross subnet, never?
  • Are you running in compatibility mode with flannel?
  • What are the CIDRs for the internal network and WireGuard subnets?
  • Does the cloud allow IPIP packets? eg this has to be explicitly enabled in AWS security groups, doesn't work at all in Digital ocean
  • Does the pod-pod network work within locations but not between different networks? Or does it not work anywhere?
  • Any debug logs from Kilo when running in logical locations?

I think this info will help us get to the error of the issue :)

from kilo.

squat avatar squat commented on May 16, 2024

Hi @ibalajiarun any news?

from kilo.

ibalajiarun avatar ibalajiarun commented on May 16, 2024

Hi @squat, sorry for the delay. I was meeting a deadline so I needed to find a quick alternative, which I found with Azure vnet peering because I use Azure exclusively. However, I want to fix this for future. I will get back to you with details in a day or two. Thanks!

from kilo.

ibalajiarun avatar ibalajiarun commented on May 16, 2024

I tried creating a new cluster with three regions and 22 nodes each; it seems to work. Not sure what fixed it because I am using a tagged image and a local kilo manifest to deploy. But let me answer your questions.

  • What cliff environment are you running in? AWS, private cloud, digital ocean etc
    I use Azure exclusively

  • What encapsulation option are you using? Always, cross subnet, never?
    I don't understand this part. There is a subnet per region.

  • Are you running in compatibility mode with flannel?
    No flannel compatibility

  • What are the CIDRs for the internal network and WireGuard subnets?
    

Wireguard: 10.42.0.0/24, 10.42.40.0/24, 10.42.53.0/24
Internal CIDRs: 10.0.0.0/24, 10.0.1.0/24, 10.0.2.0/24

  • Does the cloud allow IPIP packets? eg this has to be explicitly enabled in AWS security groups, doesn't work at all in Digital ocean
    

I have an all-open security group. I think there is no problem with Azure configuration per se

  • Does the pod-pod network work within locations but not between different networks? Or does it not work anywhere?
    

When it dint work, it used be that the pod-pod network within locations will work but not between different networks.

  • Any debug logs from Kilo when running in logical locations?
    

I am not sure if this helpful, but I have noticed the following kind of event constantly emitted in Kilo pod logs when I had the issue. When everything works, the events emit initially and then stop once everything is working.

{"caller":"mesh.go:402","component":"kilo","event":"update","level":"info","node":{"Endpoint":{"DNS":"","IP":"52.228.14.81","Port":51820},"Key":"dmhZYTFxR3lrWFdDeUhqQ0VDcEFWSStOK085V2VkLzVMbnp5b3krWXcyQT0=","InternalIP":{"IP":"10.0.0.16","Mask":"////AA=="},"LastSeen":1592006766,"Leader":false,"Location":"canadacentral","Name":"destiny-vm16","PersistentKeepalive":0,"Subnet":{"IP":"10.42.38.0","Mask":"////AA=="},"WireGuardIP":{"IP":"10.4.0.9","Mask":"//8AAA=="}},"ts":"2020-06-13T00:06:06.264090041Z"}

from kilo.

squat avatar squat commented on May 16, 2024

Hi @ibalajiarun, thank you for the detailed write up!
I think the key to this issue is the encapsulation of packets sent in the local subnet. For some context:

  • packets sent between locations are encapsulated and encrypted via WireGuard into UDP packets; these travel just fine through firewalls and this is why using a full mesh with Kilo works in Azure;
  • by default, Kilo uses IPIP encapsulation when transmitting packets within a location; even though you have the firewall set to be completely open, many cloud provider SDNs simply do not support transmitting IPIP packets because it is not a common transport protocol, ie it is not TCP or UDP; Azure is one such cloud provider that does not support forwarding IPIP packets.

I see two possible paths for fixing this:

  1. Kilo allows disabling encapsulation of packets within locations via the --encapsulate=never flag, however, for this to work, you have to disable all source and destination IP checks in Azure so that the SDN will forward packets with IPs that are unknown to it, ie packets from other clusters. This can be done in AWS and other cloud providers, but I'm not sure if this is possible in Azure. It may require internalizing some of the routing data in Azure route tables.
  2. The other option is to use flannel for local networking, since flannel uses VXLAN encapsulation, which is supported by Azure. To do this, first deploy enable flannel networking in your k3s configuration and then enable flannel compatibility mode in Kilo.

Option 1 would be ideal because it simplifies the network and reduces an extra layer of encapsulation, which saves CPU. However,. This doesn't work in all clouds, so option 2 may be necessary.

Please give these a shot and let me know how it goes. Ideally it would be great to use the result of your work too write a quick Azure compatibility doc for the Kilo repo/website.

from kilo.

anjmao avatar anjmao commented on May 16, 2024

Hi @squat, I tried 2 option running k3s with VXLAN and flannel (kilo-k3s-flannel.yaml). I have 3 nodes, AWS1 (master), AZURE1, AZURE2 (workers). I can reach AWS1<->AZURE1 and AZURE1<->AZURE2, but pinging AWS1<->AZURE2 or AZURE2<->AWS1 is not working.

cluster

AWS1

interface: kilo0
  public key: xpQbw020BJOb8xzfJw+MEaArOM46UKjJlz6FHUzuEVc=
  private key: (hidden)
  listening port: 51820

peer: Y2SusAEcJwKQsGl6Zg+PLRcRNzRTCcv8f8OR1JK/ogQ=
  endpoint: 52.146.38.226:51820
  allowed ips: 10.42.1.0/24, 10.20.1.5/32, 10.42.2.0/24, 10.20.1.4/32, 10.4.0.2/32
default via 10.10.1.1 dev ens5 proto dhcp src 10.10.1.64 metric 100 
10.4.0.0/16 dev kilo0 proto kernel scope link src 10.4.0.1 
10.10.1.0/24 dev ens5 proto kernel scope link src 10.10.1.64 
10.10.1.1 dev ens5 proto dhcp scope link src 10.10.1.64 metric 100 
10.20.1.4 via 10.4.0.2 dev kilo0 proto static onlink 
10.20.1.5 via 10.4.0.2 dev kilo0 proto static onlink 
10.42.0.0/24 dev cni0 proto kernel scope link src 10.42.0.1 
10.42.1.0/24 via 10.4.0.2 dev kilo0 proto static onlink 
10.42.2.0/24 via 10.4.0.2 dev kilo0 proto static onlink 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown

AZURE1

interface: kilo0
  public key: Y2SusAEcJwKQsGl6Zg+PLRcRNzRTCcv8f8OR1JK/ogQ=
  private key: (hidden)
  listening port: 51820

peer: xpQbw020BJOb8xzfJw+MEaArOM46UKjJlz6FHUzuEVc=
  endpoint: 67.202.58.81:51820
  allowed ips: 10.42.0.0/24, 10.10.1.64/32, 10.4.0.1/32
  latest handshake: 32 seconds ago
  transfer: 308 B received, 492 B sent
default via 10.20.1.1 dev eth0 proto dhcp src 10.20.1.5 metric 100 
10.4.0.0/16 dev kilo0 proto kernel scope link src 10.4.0.2 
10.10.1.64 via 10.4.0.1 dev kilo0 proto static onlink 
10.20.1.0/24 dev eth0 proto kernel scope link src 10.20.1.5 
10.42.0.0/24 via 10.4.0.1 dev kilo0 proto static onlink 
10.42.1.0/24 dev cni0 proto kernel scope link src 10.42.1.1 
168.63.129.16 via 10.20.1.1 dev eth0 proto dhcp src 10.20.1.5 metric 100 
169.254.169.254 via 10.20.1.1 dev eth0 proto dhcp src 10.20.1.5 metric 100 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown

AZURE2

interface: kilo0
default via 10.20.1.1 dev eth0 proto dhcp src 10.20.1.4 metric 100 
10.4.0.1 via 10.42.1.0 dev flannel.1 proto static onlink 
10.4.0.2 via 10.42.1.0 dev flannel.1 proto static onlink 
10.10.1.64 via 10.42.1.0 dev flannel.1 proto static onlink 
10.20.1.0/24 dev eth0 proto kernel scope link src 10.20.1.4 
10.42.0.0/24 via 10.42.1.0 dev flannel.1 proto static onlink 
10.42.2.0/24 dev cni0 proto kernel scope link src 10.42.2.1 
168.63.129.16 via 10.20.1.1 dev eth0 proto dhcp src 10.20.1.4 metric 100 
169.254.169.254 via 10.20.1.1 dev eth0 proto dhcp src 10.20.1.4 metric 100 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown

First issue

I pinged AWS1 node via it's private IP 10.10.1.64 from AZURE2, checked tcpdump on AZURE1 node. I saw that AZURE1 encrypts and forwards traffic to AWS1 and gets response which is written to flannel device at 10.42.2.0, but AZURE2 doesn't receive it.

19:24:42.387057 IP 10.20.1.4.39724 > 10.20.1.5.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.2.0 > 10.10.1.64: ICMP echo request, id 33665, seq 5, length 64
19:24:42.387057 IP 10.42.2.0 > 10.10.1.64: ICMP echo request, id 33665, seq 5, length 64
19:24:42.387156 IP 10.4.0.2 > 10.10.1.64: ICMP echo request, id 41028, seq 5, length 64
19:24:42.387527 IP 10.20.1.5.51820 > 67.202.58.81.51820: UDP, length 128
19:24:42.388838 IP 67.202.58.81.51820 > 10.20.1.5.51820: UDP, length 128
19:24:42.388907 IP 10.10.1.64 > 10.4.0.2: ICMP echo reply, id 41028, seq 5, length 64
19:24:42.388920 IP 10.10.1.64 > 10.42.2.0: ICMP echo reply, id 33665, seq 5, length 64

Second issue

When I pinged AZURE2 node via it's private IP 10.20.1.4 from AWS1. I saw that I AZURE1 received ping request but doesn't forward to AZURE2.

19:31:58.891753 IP 67.202.58.81.51820 > 10.20.1.5.51820: UDP, length 128
19:31:58.891854 IP 10.4.0.1 > 10.20.1.4: ICMP echo request, id 6, seq 78, length 64

Just to check after I added some forward rules via iptables I got ping back, but obviously this will hide real source IP which is not that we want.

sudo iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
sudo iptables -A FORWARD -i kilo0 -o eth0 -j ACCEPT
sudo iptables -A FORWARD -i eth0 -o kilo0 -m state --state RELATED,ESTABLISHED -j ACCEPT

It looks that flannel doesn't know how to forward traffic back to original source.

from kilo.

LinQing2017 avatar LinQing2017 commented on May 16, 2024

I am running kilo in a three-node k3s cluster. My cluster contains two partitions, but I found that no matter how to modify the configuration, the three nodes are always fully connected.

from kilo.

squat avatar squat commented on May 16, 2024

Hi @LinQing2017, I suspect you may have a different issue. My guess is that perhaps Kilo cannot discover any probs IP addresses on the nodes so they are all being turned automatically into location leaders. This happens because if nodes do not have private IP addresses, then the only way for them to communicate is over a virtual private network, ie over WireGuard. Can you please share the SVG produced by kgctl graph | circo -Tsvg > cluster.svg? And the list of private IP addresses from one of the nodes, ie the output of ip a?

from kilo.

LinQing2017 avatar LinQing2017 commented on May 16, 2024

image

Above is my SVG image.
Among them, nd-agent-01 and nd-node are in the same private network. They do not have public IP. I hope they are divided into a location, and nd-agent-01 is manually selected as the “kilo.squat.ai/leader=true" and “kilo.squat.ai/persistent-keepalive” is configured.

Goblin has a stable public network ip, which is located in another location alone. I use it as the Master node of the k3s cluster.

Now I run kilo in flannel compatibility mode, so I can achieve my purpose.

from kilo.

squat avatar squat commented on May 16, 2024

Thanks for sharing!
Ok, so it seems like the main problem here is that nd-node and nd-agent-01 are incorrectly getting split up rather than allowed to be in the same location, is that right?
For some reason, it seems that Kilo wants to put each node into its own location, which only happens in two cases:

  1. --mesh-granularity is set to full; or
  2. Kilo does not discover private IP addresses on the nodes, so they get put into their own logical location.

Just to make sure, what is the value of the --mesh-granularity flag on the Kilo daemonset?
Could you please share the output of kubectl get nodes -o yaml?

from kilo.

squat avatar squat commented on May 16, 2024

Hi @LinQing2017, please disregard the last message asking for more info as the behavior you are reporting is due to a bug in Kilo. The IP addresses of the nd-node and nd-agent-01 nodes are incorrectly being identified as public IPs because of a mistake in the IP address identification code. This is fixed by #131.

Please try running the latest Kilo image once this merges! Thanks for reporting this, otherwise we wouldn't have found the bug :)

from kilo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.