With bridged_network setup, while checking for an veth interface availability on setup_network the arping call fails with a timeout as shown by following output of lithos_tree
process:
--- 10.0.0.1 statistics ---
1 packets transmitted, 0 packets received, 100% unanswered (0 extra)
Fatal error: arping failed: exited with code 1
ARPING 10.0.0.1
Timeout
--- 10.0.0.1 statistics ---
1 packets transmitted, 0 packets received, 100% unanswered (0 extra)
Fatal error: arping failed: exited with code 1
ARPING 10.0.0.1
Timeout
TCP dump from host (bridge interface) shows no ARP response being sent:
tcpdump -i br0 -en "icmp or arp"
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on br0, link-type EN10MB (Ethernet), capture size 262144 bytes
15:15:05.564340 4a:41:e6:41:9f:a9 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 58: Request who-has 10.0.0.1 (ff:ff:ff:ff:ff:ff) tell 10.0.0.1, length 44
15:15:06.900662 82:84:a5:0c:48:b1 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 58: Request who-has 10.0.0.1 (ff:ff:ff:ff:ff:ff) tell 10.0.0.1, length 44
15:15:08.253697 ce:91:ed:2d:a7:19 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 58: Request who-has 10.0.0.1 (ff:ff:ff:ff:ff:ff) tell 10.0.0.1, length 44
15:15:09.583953 ba:3f:bc:a6:ac:e8 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 58: Request who-has 10.0.0.1 (ff:ff:ff:ff:ff:ff) tell 10.0.0.1, length 44
15:15:10.930642 3a:f1:e9:d9:3f:23 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 58: Request who-has 10.0.0.1 (ff:ff:ff:ff:ff:ff) tell 10.0.0.1, length 44
^C
5 packets captured
5 packets received by filter
0 packets dropped by kernel
bridge info on host:
br0 Link encap:Ethernet HWaddr 00:00:00:00:00:00
inet addr:10.0.0.200 Bcast:0.0.0.0 Mask:255.255.255.0
inet6 addr: fe80::1440:b0ff:fec7:5926/64 Scope:Link
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:1047 errors:0 dropped:0 overruns:0 frame:0
TX packets:727 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:62207 (60.7 KiB) TX bytes:59939 (58.5 KiB)
veth info in the lithos container:
nsenter -n -p --target 1647 /bin/sh
ifconfig -a
li-ca02f9-0001 Link encap:Ethernet HWaddr 5A:23:45:F5:F9:B4
inet addr:10.0.0.1 Bcast:0.0.0.0 Mask:255.255.255.0
inet6 addr: fe80::5823:45ff:fef5:f9b4/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:12 errors:0 dropped:0 overruns:0 frame:0
TX packets:10 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:956 (956.0 B) TX bytes:760 (760.0 B)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:16 errors:0 dropped:0 overruns:0 frame:0
TX packets:16 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:1376 (1.3 KiB) TX bytes:1376 (1.3 KiB)
I have a trouble understanding what is this check trying to accomplish because:
- the address (here 10.0.0.1) used in the arping call is that of the veth interface located inside of the container (in child namespace)
- arping is invoked from the same namespace as the interface is located (ie. child namespace)
That means that veth device with 10.0.0.1 address assigned (in the container) effectively sends ARP broadcast asking "Tell me who has 10.0.0.1" to the bridge (located on host) that the other part of veth is attached and (as expected) receives no answer (the only one that can answer this, is the local interface itself that sends the link broadcast, please note that in this case arping
beahaves differently than eg. ping
which allows pinging local interface).
I suspect that the arping call should be performed on the host (parent) namespace (possibly with specific bridge interface selected, what if different hosts with the 10.0.0.1 address are reachable from multiple host's NICs?) instead of child namespace, so that it will send ARP from bridge to veth and not the other way around it happens now.
Please see attached patch for better understanding of my proposal:
diff --git a/src/bin/lithos_knot/setup_network.rs b/src/bin/lithos_knot/setup_network.rs
index f8df4a6..db9227d 100644
--- a/src/bin/lithos_knot/setup_network.rs
+++ b/src/bin/lithos_knot/setup_network.rs
@@ -193,23 +193,27 @@ fn _setup_bridged(sandbox: &SandboxConfig, _child: &ChildInstance, ip: IpAddr)
Ok(s) if s.success() => {}
Ok(s) => bail!("ip route failed: {}", s),
Err(e) => bail!("ip route failed: {}", e),
}
}
+ setns(parent_ns.as_raw_fd(), CloneFlags::CLONE_NEWNET)?;
+
let mut cmd = unshare::Command::new("/usr/bin/arping");
cmd.arg("-U");
cmd.arg("-c1");
cmd.arg(&format!("{}", ip));
debug!("Running {}", cmd.display(&Style::short()));
match cmd.status() {
Ok(s) if s.success() => {}
Ok(s) => bail!("arping failed: {}", s),
Err(e) => bail!("arping failed: {}", e),
}
+ setns(my_ns.as_raw_fd(), CloneFlags::CLONE_NEWNET)?;
+
Ok(())
}
fn _setup_isolated(_sandbox: &SandboxConfig, _child: &ChildInstance)
-> Result<(), Error>
{
I have observed this error on Alpine v3.7 after PR #15 applied.
Thank you for any ideas or opinions on this.
BTW many thanks to you and other contributors for an awesome containerization stuff (not just lithos). It saves us a lot of time and sanity by not having to deal with Docker :).