Comments (12)
@squat do you have any ideas? I am trying to connect three clusters with 30 nodes each, and doing a full mesh is taking too much CPU. thanks in advance!
from kilo.
Hi @ibalajiarun :) it's very exciting that you applying Kilo to such large clusters. I am running logical messages on several clusters currently, so I suspect there is something at okay with your specific deployment. Can you please share the following details?
- What cliff environment are you running in? AWS, private cloud, digital ocean etc
- What encapsulation option are you using? Always, cross subnet, never?
- Are you running in compatibility mode with flannel?
- What are the CIDRs for the internal network and WireGuard subnets?
- Does the cloud allow IPIP packets? eg this has to be explicitly enabled in AWS security groups, doesn't work at all in Digital ocean
- Does the pod-pod network work within locations but not between different networks? Or does it not work anywhere?
- Any debug logs from Kilo when running in logical locations?
I think this info will help us get to the error of the issue :)
from kilo.
Hi @ibalajiarun any news?
from kilo.
Hi @squat, sorry for the delay. I was meeting a deadline so I needed to find a quick alternative, which I found with Azure vnet peering because I use Azure exclusively. However, I want to fix this for future. I will get back to you with details in a day or two. Thanks!
from kilo.
I tried creating a new cluster with three regions and 22 nodes each; it seems to work. Not sure what fixed it because I am using a tagged image and a local kilo manifest to deploy. But let me answer your questions.
-
What cliff environment are you running in? AWS, private cloud, digital ocean etc
I use Azure exclusively -
What encapsulation option are you using? Always, cross subnet, never?
I don't understand this part. There is a subnet per region. -
Are you running in compatibility mode with flannel?
No flannel compatibility -
What are the CIDRs for the internal network and WireGuard subnets?
Wireguard: 10.42.0.0/24, 10.42.40.0/24, 10.42.53.0/24
Internal CIDRs: 10.0.0.0/24, 10.0.1.0/24, 10.0.2.0/24
-
Does the cloud allow IPIP packets? eg this has to be explicitly enabled in AWS security groups, doesn't work at all in Digital ocean
I have an all-open security group. I think there is no problem with Azure configuration per se
-
Does the pod-pod network work within locations but not between different networks? Or does it not work anywhere?
When it dint work, it used be that the pod-pod network within locations will work but not between different networks.
-
Any debug logs from Kilo when running in logical locations?
I am not sure if this helpful, but I have noticed the following kind of event constantly emitted in Kilo pod logs when I had the issue. When everything works, the events emit initially and then stop once everything is working.
{"caller":"mesh.go:402","component":"kilo","event":"update","level":"info","node":{"Endpoint":{"DNS":"","IP":"52.228.14.81","Port":51820},"Key":"dmhZYTFxR3lrWFdDeUhqQ0VDcEFWSStOK085V2VkLzVMbnp5b3krWXcyQT0=","InternalIP":{"IP":"10.0.0.16","Mask":"////AA=="},"LastSeen":1592006766,"Leader":false,"Location":"canadacentral","Name":"destiny-vm16","PersistentKeepalive":0,"Subnet":{"IP":"10.42.38.0","Mask":"////AA=="},"WireGuardIP":{"IP":"10.4.0.9","Mask":"//8AAA=="}},"ts":"2020-06-13T00:06:06.264090041Z"}
from kilo.
Hi @ibalajiarun, thank you for the detailed write up!
I think the key to this issue is the encapsulation of packets sent in the local subnet. For some context:
- packets sent between locations are encapsulated and encrypted via WireGuard into UDP packets; these travel just fine through firewalls and this is why using a full mesh with Kilo works in Azure;
- by default, Kilo uses IPIP encapsulation when transmitting packets within a location; even though you have the firewall set to be completely open, many cloud provider SDNs simply do not support transmitting IPIP packets because it is not a common transport protocol, ie it is not TCP or UDP; Azure is one such cloud provider that does not support forwarding IPIP packets.
I see two possible paths for fixing this:
- Kilo allows disabling encapsulation of packets within locations via the
--encapsulate=never
flag, however, for this to work, you have to disable all source and destination IP checks in Azure so that the SDN will forward packets with IPs that are unknown to it, ie packets from other clusters. This can be done in AWS and other cloud providers, but I'm not sure if this is possible in Azure. It may require internalizing some of the routing data in Azure route tables. - The other option is to use flannel for local networking, since flannel uses VXLAN encapsulation, which is supported by Azure. To do this, first deploy enable flannel networking in your k3s configuration and then enable flannel compatibility mode in Kilo.
Option 1 would be ideal because it simplifies the network and reduces an extra layer of encapsulation, which saves CPU. However,. This doesn't work in all clouds, so option 2 may be necessary.
Please give these a shot and let me know how it goes. Ideally it would be great to use the result of your work too write a quick Azure compatibility doc for the Kilo repo/website.
from kilo.
Hi @squat, I tried 2 option running k3s with VXLAN and flannel (kilo-k3s-flannel.yaml). I have 3 nodes, AWS1 (master), AZURE1, AZURE2 (workers). I can reach AWS1<->AZURE1 and AZURE1<->AZURE2, but pinging AWS1<->AZURE2 or AZURE2<->AWS1 is not working.
AWS1
interface: kilo0
public key: xpQbw020BJOb8xzfJw+MEaArOM46UKjJlz6FHUzuEVc=
private key: (hidden)
listening port: 51820
peer: Y2SusAEcJwKQsGl6Zg+PLRcRNzRTCcv8f8OR1JK/ogQ=
endpoint: 52.146.38.226:51820
allowed ips: 10.42.1.0/24, 10.20.1.5/32, 10.42.2.0/24, 10.20.1.4/32, 10.4.0.2/32
default via 10.10.1.1 dev ens5 proto dhcp src 10.10.1.64 metric 100
10.4.0.0/16 dev kilo0 proto kernel scope link src 10.4.0.1
10.10.1.0/24 dev ens5 proto kernel scope link src 10.10.1.64
10.10.1.1 dev ens5 proto dhcp scope link src 10.10.1.64 metric 100
10.20.1.4 via 10.4.0.2 dev kilo0 proto static onlink
10.20.1.5 via 10.4.0.2 dev kilo0 proto static onlink
10.42.0.0/24 dev cni0 proto kernel scope link src 10.42.0.1
10.42.1.0/24 via 10.4.0.2 dev kilo0 proto static onlink
10.42.2.0/24 via 10.4.0.2 dev kilo0 proto static onlink
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
AZURE1
interface: kilo0
public key: Y2SusAEcJwKQsGl6Zg+PLRcRNzRTCcv8f8OR1JK/ogQ=
private key: (hidden)
listening port: 51820
peer: xpQbw020BJOb8xzfJw+MEaArOM46UKjJlz6FHUzuEVc=
endpoint: 67.202.58.81:51820
allowed ips: 10.42.0.0/24, 10.10.1.64/32, 10.4.0.1/32
latest handshake: 32 seconds ago
transfer: 308 B received, 492 B sent
default via 10.20.1.1 dev eth0 proto dhcp src 10.20.1.5 metric 100
10.4.0.0/16 dev kilo0 proto kernel scope link src 10.4.0.2
10.10.1.64 via 10.4.0.1 dev kilo0 proto static onlink
10.20.1.0/24 dev eth0 proto kernel scope link src 10.20.1.5
10.42.0.0/24 via 10.4.0.1 dev kilo0 proto static onlink
10.42.1.0/24 dev cni0 proto kernel scope link src 10.42.1.1
168.63.129.16 via 10.20.1.1 dev eth0 proto dhcp src 10.20.1.5 metric 100
169.254.169.254 via 10.20.1.1 dev eth0 proto dhcp src 10.20.1.5 metric 100
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
AZURE2
interface: kilo0
default via 10.20.1.1 dev eth0 proto dhcp src 10.20.1.4 metric 100
10.4.0.1 via 10.42.1.0 dev flannel.1 proto static onlink
10.4.0.2 via 10.42.1.0 dev flannel.1 proto static onlink
10.10.1.64 via 10.42.1.0 dev flannel.1 proto static onlink
10.20.1.0/24 dev eth0 proto kernel scope link src 10.20.1.4
10.42.0.0/24 via 10.42.1.0 dev flannel.1 proto static onlink
10.42.2.0/24 dev cni0 proto kernel scope link src 10.42.2.1
168.63.129.16 via 10.20.1.1 dev eth0 proto dhcp src 10.20.1.4 metric 100
169.254.169.254 via 10.20.1.1 dev eth0 proto dhcp src 10.20.1.4 metric 100
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
First issue
I pinged AWS1 node via it's private IP 10.10.1.64 from AZURE2, checked tcpdump on AZURE1 node. I saw that AZURE1 encrypts and forwards traffic to AWS1 and gets response which is written to flannel device at 10.42.2.0, but AZURE2 doesn't receive it.
19:24:42.387057 IP 10.20.1.4.39724 > 10.20.1.5.8472: OTV, flags [I] (0x08), overlay 0, instance 1
IP 10.42.2.0 > 10.10.1.64: ICMP echo request, id 33665, seq 5, length 64
19:24:42.387057 IP 10.42.2.0 > 10.10.1.64: ICMP echo request, id 33665, seq 5, length 64
19:24:42.387156 IP 10.4.0.2 > 10.10.1.64: ICMP echo request, id 41028, seq 5, length 64
19:24:42.387527 IP 10.20.1.5.51820 > 67.202.58.81.51820: UDP, length 128
19:24:42.388838 IP 67.202.58.81.51820 > 10.20.1.5.51820: UDP, length 128
19:24:42.388907 IP 10.10.1.64 > 10.4.0.2: ICMP echo reply, id 41028, seq 5, length 64
19:24:42.388920 IP 10.10.1.64 > 10.42.2.0: ICMP echo reply, id 33665, seq 5, length 64
Second issue
When I pinged AZURE2 node via it's private IP 10.20.1.4 from AWS1. I saw that I AZURE1 received ping request but doesn't forward to AZURE2.
19:31:58.891753 IP 67.202.58.81.51820 > 10.20.1.5.51820: UDP, length 128
19:31:58.891854 IP 10.4.0.1 > 10.20.1.4: ICMP echo request, id 6, seq 78, length 64
Just to check after I added some forward rules via iptables I got ping back, but obviously this will hide real source IP which is not that we want.
sudo iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
sudo iptables -A FORWARD -i kilo0 -o eth0 -j ACCEPT
sudo iptables -A FORWARD -i eth0 -o kilo0 -m state --state RELATED,ESTABLISHED -j ACCEPT
It looks that flannel doesn't know how to forward traffic back to original source.
from kilo.
I am running kilo in a three-node k3s cluster. My cluster contains two partitions, but I found that no matter how to modify the configuration, the three nodes are always fully connected.
from kilo.
Hi @LinQing2017, I suspect you may have a different issue. My guess is that perhaps Kilo cannot discover any probs IP addresses on the nodes so they are all being turned automatically into location leaders. This happens because if nodes do not have private IP addresses, then the only way for them to communicate is over a virtual private network, ie over WireGuard. Can you please share the SVG produced by kgctl graph | circo -Tsvg > cluster.svg
? And the list of private IP addresses from one of the nodes, ie the output of ip a
?
from kilo.
Above is my SVG image.
Among them, nd-agent-01 and nd-node are in the same private network. They do not have public IP. I hope they are divided into a location, and nd-agent-01 is manually selected as the “kilo.squat.ai/leader=true" and “kilo.squat.ai/persistent-keepalive” is configured.
Goblin has a stable public network ip, which is located in another location alone. I use it as the Master node of the k3s cluster.
Now I run kilo in flannel compatibility mode, so I can achieve my purpose.
from kilo.
Thanks for sharing!
Ok, so it seems like the main problem here is that nd-node
and nd-agent-01
are incorrectly getting split up rather than allowed to be in the same location, is that right?
For some reason, it seems that Kilo wants to put each node into its own location, which only happens in two cases:
--mesh-granularity
is set tofull
; or- Kilo does not discover private IP addresses on the nodes, so they get put into their own logical location.
Just to make sure, what is the value of the --mesh-granularity
flag on the Kilo daemonset?
Could you please share the output of kubectl get nodes -o yaml
?
from kilo.
Hi @LinQing2017, please disregard the last message asking for more info as the behavior you are reporting is due to a bug in Kilo. The IP addresses of the nd-node
and nd-agent-01
nodes are incorrectly being identified as public IPs because of a mistake in the IP address identification code. This is fixed by #131.
Please try running the latest Kilo image once this merges! Thanks for reporting this, otherwise we wouldn't have found the bug :)
from kilo.
Related Issues (20)
- NAT Node not ready, cannot ping wireguard HOT 1
- The pod kilo-* in node was evicted when the memory is out HOT 2
- Connection to K8S Service - SourceIP is not preserved (Source NAT) HOT 7
- istio support HOT 1
- Add Kilo in Cilium USER.md HOT 4
- 在k3s上运行失败 HOT 2
- Peering clusters behind nat HOT 3
- Calico or Althea support HOT 1
- [Question]How Kilo works?
- nodes with same subnet in cluster
- spamming error "exit status 1: iptables v1.8.4 (nf_tables): table `filter' is incompatible, use 'nft' tool." HOT 8
- Request: Add feature to specify source ip address for all egress HOT 7
- Cluster with control-plane running in GKE and edge nodes behind NAT HOT 1
- does kilo support aws eks with aws-vpc-cni? HOT 2
- Use private network where possible in fully meshed network HOT 4
- Use Pod/Service IP as Egress Point / Egress Gateway Implementation
- Kilo Incorrectly Chooses an eth0 IP Over Node's Configured Internal IP HOT 6
- kgctl connect improve availability
- Has anyone tried to hack `k0s` support together?
- Question: Is the project still maintained? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kilo.