Giter Site home page Giter Site logo

libcalico's Introduction

Circle CI Coverage Status

libcalico

NOTE: Python libcalico is no longer being actively developed, and as such is likely to become out-of-date and potentially incompatible with newer Calico versions and features. Instead, it is strongly recommended to use the Golang library, libcalico-go. Feel free to contribute patches to this repo as the maintainers will continue to review and merge community PRs.

Libcalico is a library for interacting with the Calico data model. It also contains code for working with veths.

  • It's written in Python (though ports into other languages would be welcomed as PRs)
  • It currently just talks to etcd as the backend datastore.

It's currently focused on the the container side of Calico, though again PRs are welcomed to make it more general.

Running tests

To run tests for libcalico:

  1. Install Docker.

  2. At the root of the libcalico directory, run:

     make test
    

Analytics

libcalico's People

Contributors

adidenko avatar alexaltair avatar alexwlchan avatar artem-panchenko avatar caoshufeng avatar caseydavenport avatar david-dever-23-box avatar djlwilder avatar djosborne avatar fasaxc avatar frnkdny avatar insequent avatar keshto avatar lukasa avatar luke-mino-altherr avatar lwr20 avatar matthewdupre avatar mgleung avatar mikev avatar ozdanborne avatar paultiplady avatar robbrockbank avatar symmetric avatar tomdee avatar tonicmuroq avatar trimbiggs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

libcalico's Issues

Use $HOSTNAME environment variable instead of `socket.gethostname()`

Lots of libcalico API calls use socket.gethostname() to get the name of the current host. This is fine in most circumstances, but limits flexibility in some environments.

In particular, we actually only care that the hostname is a unique, stable identifier for the compute host. Some orchestration systems, like Mesos, use reverse-name-resolution to name the compute host. This can leave Calico inconsistent with the orchestrator (e.g. Mesos has FQDN and Calico just has a local name).

After discussing with @tomdee, we agreed a good solution would be to use $HOSTNAME instead of socket.gethostname(). It's a standard environment variable that is ordinarily set to the same value as would be returned by socket.gethostname(). But, it gives an admin more flexibility, since they can override an environment variable without affecting other systems and processes.

IPs allocated from a stale block when a pool is deleted

Scenario:

Node configured with default calico pool.
A container is created on the node, causing a block to be allocated to the host.
The default pool is deleted, and a new pool is added.
Another container is created on the node.

Result: the container is allocated an IP from the initially allocated block, instead of the pool that is currently configured.

Expected: the container gets an IP from the new pool. It's fine if the existing containers keep the IPs from the stale block.

IPAM failures when running kubernetes scale test

I'm not sure if this is a libcalico issue or a kubernetes plugin issue.

When spinning up pods on 25 hosts, I hit this issue on two of the hosts. It caused one pod to fail on the problem hosts. The other hosts had no problems.

2015-12-18 19:50:33,201 1335 INFO No initialization work to perform
2015-12-18 20:04:29,722 2284 [dcdc7ab5159b] INFO Executing Calico pod-creation hook
2015-12-18 20:04:29,729 2284 [dcdc7ab5159b] INFO Configuring pod default/pinger-wgrjp (container_id dcdc7ab5159b3c4fd4b5eac21934892cbfed57d784dad770a483f82386cbd112)
2015-12-18 20:04:29,810 2284 [dcdc7ab5159b] INFO Configuring Calico network interface
2015-12-18 20:04:29,826 2284 [dcdc7ab5159b] INFO Using Calico IPAM
2015-12-18 20:04:29,827 2284 [dcdc7ab5159b] INFO pycalico.ipam: Auto-assign 1 IPv4, 0 IPv6 addrs
2015-12-18 20:04:29,848 2284 [dcdc7ab5159b] INFO pycalico.ipam: Ran out of affine blocks for calico-19 in pool None
2015-12-18 20:04:30,617 2322 [1a4a23d29095] INFO Executing Calico pod-creation hook
2015-12-18 20:04:30,619 2322 [1a4a23d29095] INFO Configuring pod default/pinger-v9hq9 (container_id 1a4a23d29095b633f5d89bd883062f29d53f8185e71e8e02bd1cdc722b39163f)
2015-12-18 20:04:30,750 2322 [1a4a23d29095] INFO Configuring Calico network interface
2015-12-18 20:04:30,813 2322 [1a4a23d29095] INFO Using Calico IPAM
2015-12-18 20:04:30,814 2322 [1a4a23d29095] INFO pycalico.ipam: Auto-assign 1 IPv4, 0 IPv6 addrs
2015-12-18 20:04:30,966 2322 [1a4a23d29095] INFO pycalico.ipam: Ran out of affine blocks for calico-19 in pool None
2015-12-18 20:04:33,081 2322 [1a4a23d29095] ERROR Error networking pod - cleaning up
Traceback (most recent call last):
  File "<string>", line 120, in create
  File "<string>", line 334, in _configure_interface
  File "<string>", line 374, in _create_endpoint
  File "<string>", line 428, in _assign_container_ip
  File "/code/build/calico/out00-PYZ.pyz/pycalico.ipam", line 413, in auto_assign_ips
  File "/code/build/calico/out00-PYZ.pyz/pycalico.ipam", line 506, in _auto_assign
  File "/code/build/calico/out00-PYZ.pyz/pycalico.ipam", line 153, in _new_affine_block
  File "/code/build/calico/out00-PYZ.pyz/pycalico.ipam", line 188, in _claim_block_affinity
  File "/code/build/calico/out00-PYZ.pyz/etcd.client", line 584, in delete
  File "/code/build/calico/out00-PYZ.pyz/etcd.client", line 848, in wrapper
  File "/code/build/calico/out00-PYZ.pyz/etcd.client", line 928, in _handle_server_response
  File "/code/build/calico/out00-PYZ.pyz/etcd", line 304, in handle
EtcdKeyNotFound: Key not found : /calico/ipam/v2/host/calico-19/ipv4/block/192.168.4.128-26
2015-12-18 20:04:33,083 2322 [1a4a23d29095] INFO Removing networking from pod default/pinger-v9hq9 (container id 1a4a23d29095b633f5d89bd883062f29d53f8185e71e8e02bd1cdc722b39163f)
2015-12-18 20:04:33,120 2322 [1a4a23d29095] ERROR Error cleaning up pod
Traceback (most recent call last):
  File "<string>", line 129, in create
  File "<string>", line 155, in delete
SystemExit: 0
2015-12-18 20:04:33,120 2322 [1a4a23d29095] INFO Done cleaning up
2015-12-18 20:04:34,014 2284 [dcdc7ab5159b] INFO pycalico.ipam: Auto-assigned IPv4s ['192.168.5.0']
2015-12-18 20:04:34,054 2284 [dcdc7ab5159b] INFO pycalico.ipam: Auto-assigned IPv6s []
2015-12-18 20:04:34,055 2284 [dcdc7ab5159b] INFO Creating Calico endpoint with IPs [IPAddress('192.168.5.0')]
2015-12-18 20:04:34,442 2284 [dcdc7ab5159b] INFO Finished configuring network interface
2015-12-18 20:04:34,443 2284 [dcdc7ab5159b] INFO Created Calico endpoint: 8eac3af4a5c211e58b7f080027684567
2015-12-18 20:04:34,632 2284 [dcdc7ab5159b] INFO Setting profile 'default-profile' on endpoint 8eac3af4a5c211e58b7f080027684567
2015-12-18 20:04:34,889 2284 [dcdc7ab5159b] INFO Successfully configured networking for pod default/pinger-wgrjp
2015-12-18 20:04:34,899 2284 [dcdc7ab5159b] WARNING TIMING,setup,default,pinger-wgrjp,dcdc7ab5159b3c4fd4b5eac21934892cbfed57d784dad770a483f82386cbd112,5.17914915085
2015-12-18 20:04:36,948 2416 [dcdc7ab5159b] WARNING TIMING,status,pinger-wgrjp,default,dcdc7ab5159b3c4fd4b5eac21934892cbfed57d784dad770a483f82386cbd112,0.0975790023804
2015-12-18 20:04:38,013 2456 [f2ed777024f6] INFO Executing Calico pod-creation hook
2015-12-18 20:04:38,015 2456 [f2ed777024f6] INFO Configuring pod default/pinger-jmt3z (container_id f2ed777024f6f17635456c81c169790818061ff358a0d473fb9454ff818a4d53)
2015-12-18 20:04:38,121 2456 [f2ed777024f6] INFO Configuring Calico network interface
2015-12-18 20:04:38,247 2456 [f2ed777024f6] INFO Using Calico IPAM
2015-12-18 20:04:38,248 2456 [f2ed777024f6] INFO pycalico.ipam: Auto-assign 1 IPv4, 0 IPv6 addrs
2015-12-18 20:04:39,024 2456 [f2ed777024f6] INFO pycalico.ipam: Auto-assigned IPv4s ['192.168.5.1']
2015-12-18 20:04:39,177 2456 [f2ed777024f6] INFO pycalico.ipam: Auto-assigned IPv6s []
2015-12-18 20:04:39,177 2456 [f2ed777024f6] INFO Creating Calico endpoint with IPs [IPAddress('192.168.5.1')]
2015-12-18 20:04:39,872 2456 [f2ed777024f6] INFO Finished configuring network interface
2015-12-18 20:04:39,873 2456 [f2ed777024f6] INFO Created Calico endpoint: 91b9e336a5c211e59066080027684567

IP allocation of first address can be quite slow on a large system

When you grab the first address that has ever been used on a host, it scans through all of the known blocks checking to see if they are in use and attempting to grab them. Unfortunately, that can be quite slow on a large system. The example I have in front of me checked 1284 blocks taking 74 seconds before it found an unused one (though this is not a particularly fast server, and there was contention both for etcd and for CPU on the host itself, so I can believe this is uncharacteristic).

Two possible solutions came to mind.

  • Do not scan through them in a fixed order, so that you are not guaranteed to hit all the used blocks before the unused ones. This is cheap and easy, but would actually break some of my test code, and doesn't feel like a long term solution.
  • Have some kind of unused blocks list (that cannot be trusted and must be checked) to give you a hint which blocks are in use.

There is no significant impact on this issue on me in practice; I can grab enough blocks at start of day to avoid the issue. I reckon this might need to be fixed, but not at high priority.

IPAM raising KeyError

It's not impossible that this is a problem with calling code, but here is a stack trace that to me implies a possible bug in the libcalico IPAM.

My suspicion is that the issue was caused by multiple hosts attempting to grab an IP at the same time, and that one of them found that the block it was about to grab went away under its feet (this is a moderately sized scale test with 200 hosts, one of which failed). Is that plausible? If so, I'd like to aim for a workaround (assuming the fix isn't too hard). Would catching the KeyError and trying again be sensible?

2015-11-02 15:07:47,642:INFO:auto_assign_ips(407): Auto-assign 1 IPv4, 0 IPv6 addrs
2015-11-02 15:07:47,673:INFO:_auto_assign(457): Ran out of affine blocks for plw-host-0138 in pool None
2015-11-02 15:07:48,427:ERROR:<module>(405): Main exiting
Traceback (most recent call last):
  File "./host-agent.py", line 403, in <module>
    main()
  File "./host-agent.py", line 386, in main
    start_job(job_id, sub_job, etcd_client)
  File "./host-agent.py", line 130, in start_job
    attributes={})
  File "/usr/local/lib/python2.7/site-packages/pycalico/ipam.py", line 409, in auto_assign_ips
    attributes, pool[0], hostname)
  File "/usr/local/lib/python2.7/site-packages/pycalico/ipam.py", line 475, in _auto_assign
    pool)
  File "/usr/local/lib/python2.7/site-packages/pycalico/ipam.py", line 151, in _new_affine_block
    self._claim_block_affinity(host, block_cidr)
  File "/usr/local/lib/python2.7/site-packages/pycalico/ipam.py", line 177, in _claim_block_affinity
    block = self._read_block(block_cidr)
  File "/usr/local/lib/python2.7/site-packages/pycalico/ipam.py", line 65, in _read_block
    raise KeyError(str(block_cidr))
KeyError: '192.168.56.0/26'

The calling code in host-agent.py looks like this.

            ip_list, _ = ipam_client.auto_assign_ips(num_v4=1,
                                                     num_v6=0,
                                                     handle_id=None,
                                                     attributes={})

Exception in netns.py

@lwr20 was rerunning our scale testing script today, the build pulled in a new revision of libcalico and we started getting this error. He's going to retry with an older revision to confirm that it was a regression.

Traceback (most recent call last):
  File "./host-agent.py", line 354, in <module>
    main()
  File "./host-agent.py", line 337, in main
    start_job(job_id, sub_job)
  File "./host-agent.py", line 199, in start_job
    netns.move_veth_into_ns(pid, ep.temp_interface_name, interface)
  File "/usr/local/lib/python2.7/site-packages/pycalico/netns.py", line 118, in move_veth_into_ns
    with NamedNamespace(namespace) as ns:
  File "/usr/local/lib/python2.7/site-packages/pycalico/netns.py", line 227, in __init__
    self.ns_path = namespace.path
AttributeError: 'int' object has no attribute 'path'

libcalico yields misleading error from bad ETCD_AUTHORITY env var

If ETCD_AUTHORITY environment is missing a colon, pycalico will fail to initialize the IPAMClient with a misleading stack trace.

$ ETCD_AUTHORITY=erroneous python -c "import pycalico.ipam; pycalico.ipam.IPAMClient()"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pycalico/datastore.py", line 137, in __init__
    (host, port) = etcd_authority.split(":", 1)
ValueError: need more than 1 value to unpack

Link to line 137

I propose that we should catch the value error and reword it to state "malformed ETCD_AUTHORITY", possibly in the form of a new MalformedEtcdAuthority exception?

add support for the workloadendpoint.active_instance_id field

The workloadendpoint.active_instance_id field was introduced in libcalico-go (projectcalico/libcalico-go#396) to fix https://github.com/projectcalico/cni-plugin/issues/310 .
Currently, libcalico-go has the field, libcalico doesn't.

We should also add it to libcalico.

Expected Behavior

Run calico-policy-controller and the cni-plugin e.g. on a three-node Kubernetes cluster.
Delete a pod and immediately recreate it with the same name and schedule it on the same host (which is, what StatefulSets do when their pods vanish; scheduling to the same host can happen).
The newly created pod should have network connectivity.

Current Behavior

The newly created pod looses network connectivity after ~30 seconds.

This is due to these inner workings:
When creating the new pod, Kuberenetes calls the cni-plugin with CNI_COMMAND=ADD and CNI_CONTAINERID=...the new pod's id... . The cni-plugin writes a workloadendpoint into etcd with the active_instance_id field set to the new pod's ID.
The calico-policy-controller updates the workloadendpoint (~after 2 seconds), effectively removing the active_instance_id field.
After ~30 seconds, the cni-plugin gets CNI_COMMAND=DEL call from Kubernetes with CNI_CONTAINERID=..the old pod's id.... As the active_instance_id field is gone from etcd, the cni-plugin deletes the new pod's network connectivity.

To summarize: As calico-policy-controller removes the active_instance_id field from workloadendpoints when updating them, the effects described in https://github.com/projectcalico/cni-plugin/issues/310 reappear.

Possible Solution

Adding the workloadendpoint.active_instance_id field to libcalico fixes this behavior.

hostname must not contain `_` ?

from here I found that when validating hostname, no '_' is allowed. But if I use some CI tools like gitlab-ci, when running in docker containers, the host name of the linked container would be 'hub.docker.com__service__etcd' if the image is 'hub.docker.com/service/etcd'. Could this be changed?

How do I install pycalico?

I just hit a problem in the main calico-docker repo which ended in an ImportError with pycalico:

alexwlchan at ubuntuvm in /m/d/calico-docker on git:master
$ dist/calicoctl checksystem
Traceback (most recent call last):
  File "", line 43, in 
ImportError: No module named pycalico.datastore_errors

What if I try installing with pip? Nope:

alexwlchan at ubuntuvm in /m/d/calico-docker on git:master
$ sudo pip install pycalico
Collecting pycalico
  Could not find a version that satisfies the requirement pycalico (from versions: )
No matching distribution found for pycalico

Nothing on PyPI; nothing at https://github.com/projectcalico/pycalico. I eventually find this repo, check it out, and run pip install -e . from the root. But I still get the same error.

I assume I must have missed a step somewhere, but what?

IP's allocated from original pool after deletion

This seems to be a reoccurrence of #56. I used https://github.com/projectcalico/calico/blob/master/v2.0/getting-started/kubernetes/installation/hosted/calico.yaml to deploy into a kops cluster.

System pods get 192.168.0.0/16 ip's:

admin@ip-172-20-83-74:~$ kubectl get po -n kube-system -o wide
NAME                                                                 READY     STATUS    RESTARTS   AGE       IP                NODE
...
dns-controller-2522975163-kbdx6                                      1/1       Running   0          2m        192.168.138.128   ip-172-20-83-74.us-west-1.compute.internal
kube-dns-v20-3531996453-16zaj                                        2/3       Running   0          1h        192.168.140.0     ip-172-20-89-244.us-west-1.compute.internal
kube-dns-v20-3531996453-l9px7                                        3/3       Running   0          1h        192.168.149.128   ip-172-20-117-143.us-west-1.compute.internal
...

Default pool is deleted and new pool is created:

admin@ip-172-20-83-74:~$ calicoctl delete IpPool 192.168.0.0/16
Successfully deleted 1 'ipPool' resource(s)
admin@ip-172-20-83-74:~$ cat << EOF | calicoctl create -f -
 - apiVersion: v1
>   kind: ipPool
>   metadata:
>     cidr: 100.64.0.0/10
>   spec:
>     ipip:
>       enabled: true
>     nat-outgoing: true
> EOF
Successfully created 1 'ipPool' resource(s)

System pods deleted:

admin@ip-172-20-83-74:~$ kubectl delete po dns-controller-2522975163-kbdx6 kube-dns-v20-3531996453-16zaj kube-dns-v20-3531996453-l9px7 -n kube-system
pod "dns-controller-2522975163-kbdx6" deleted
pod "kube-dns-v20-3531996453-16zaj" deleted
pod "kube-dns-v20-3531996453-l9px7" deleted

New system pods still have ip's from 192.168.0.0/16:

admin@ip-172-20-83-74:~$ kubectl get po -n kube-system -o wide
NAME                                                                 READY     STATUS        RESTARTS   AGE       IP                NODE
...
dns-controller-2522975163-dczfr                                      1/1       Running       0          3s        192.168.138.129   ip-172-20-83-74.us-west-1.compute.internal
kube-dns-v20-3531996453-16zaj                                        3/3       Terminating   0          1h        192.168.140.0     ip-172-20-89-244.us-west-1.compute.internal
kube-dns-v20-3531996453-c7z23                                        2/3       Running       0          3s        192.168.140.1     ip-172-20-89-244.us-west-1.compute.internal
kube-dns-v20-3531996453-l9px7                                        3/3       Terminating   0          1h        192.168.149.128   ip-172-20-117-143.us-west-1.compute.internal
kube-dns-v20-3531996453-ujdum                                        2/3       Running       0          3s        192.168.149.129   ip-172-20-117-143.us-west-1.compute.internal
...

Two hosts managed to get the same block

The code I was running does not include the fix to issue #47 - if you think that's the issue, feel free to comment and close without wasting any time on it.

I have 500 hosts trying to grab addresses at once. See the following.

core@plw-etcd-00 ~ $ etcdctl ls /calico/ipam/v2 --recursive | grep 192.168.242.192-26
/calico/ipam/v2/host/host-0282/ipv4/block/192.168.242.192-26
/calico/ipam/v2/host/host-0268/ipv4/block/192.168.242.192-26
/calico/ipam/v2/assignment/ipv4/block/192.168.242.192-26
core@plw-etcd-00 ~ $ etcdctl get /calico/ipam/v2/assignment/ipv4/block/192.168.242.192-26
{"attributes": [{"handle_id": null, "secondary": {}}], "cidr": "192.168.242.192/26", "affinity": "host:host-0282", "allocations": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

I have some trace of host-0268 that is probably relevant.

2015-12-02 09:55:17,151:INFO:[140336892864256]:start_job(126): Starting job 1-test-containers-host-0268 on host-0268, {u'environment': {u'PING_TARGET': u'172.31.0.111'}, u'manual_network': {u'profile_ids': [u'prof-110']}, u'container': u'10.240.0.6:5000/ping-agent', u'name': u'host-0268_001-to-172.31.0.111-host-0170', u'exec': [u'python', u'/usr/src/app/ping-agent.py']}
2015-12-02 09:55:17,151:DEBUG:[140336892864256]:read(466): Issuing read for key /calico/bgp/v1/host/plw-host-0268/ip_addr_v4 with args {}
2015-12-02 09:55:17,162:DEBUG:[140336892864256]:_make_request(387): "GET /v2/keys/calico/bgp/v1/host/plw-host-0268/ip_addr_v4 HTTP/1.1" 200 142
2015-12-02 09:55:17,163:DEBUG:[140336892864256]:read(466): Issuing read for key /calico/bgp/v1/host/plw-host-0268/ip_addr_v6 with args {}
2015-12-02 09:55:17,165:DEBUG:[140336892864256]:_make_request(387): "GET /v2/keys/calico/bgp/v1/host/plw-host-0268/ip_addr_v6 HTTP/1.1" 200 131
2015-12-02 09:55:17,166:DEBUG:[140336892864256]:read(466): Issuing read for key /calico/v1/ipam/v4/pool/ with args {'recursive': True}
2015-12-02 09:55:17,172:DEBUG:[140336892864256]:_make_request(387): "GET /v2/keys/calico/v1/ipam/v4/pool/?recursive=true HTTP/1.1" 200 466
2015-12-02 09:55:17,173:INFO:[140336892864256]:auto_assign_ips(410): Auto-assign 1 IPv4, 0 IPv6 addrs
2015-12-02 09:55:17,173:DEBUG:[140336892864256]:read(466): Issuing read for key /calico/ipam/v2/host/host-0268/ipv4/block/ with args {}
2015-12-02 09:55:17,175:DEBUG:[140336892864256]:_make_request(387): "GET /v2/keys/calico/ipam/v2/host/host-0268/ipv4/block/ HTTP/1.1" 200 267
2015-12-02 09:55:17,175:DEBUG:[140336892864256]:_auto_assign_block(534): Auto-assigning from block 192.168.242.192/26
2015-12-02 09:55:17,176:DEBUG:[140336892864256]:_auto_assign_block(536): Auto-assign from 192.168.242.192/26, retry 0
2015-12-02 09:55:17,176:DEBUG:[140336892864256]:read(466): Issuing read for key /calico/ipam/v2/assignment/ipv4/block/192.168.242.192-26 with args {}
2015-12-02 09:55:17,177:DEBUG:[140336892864256]:_make_request(387): "GET /v2/keys/calico/ipam/v2/assignment/ipv4/block/192.168.242.192-26 HTTP/1.1" 200 638
2015-12-02 09:55:17,178:ERROR:[140336892864256]:process_jobs(615): Failed to create container
Traceback (most recent call last):
  File "./host-agent.py", line 612, in process_jobs
    etcd_client, ipam_client, docker_client)
  File "./host-agent.py", line 165, in start_job
    attributes={})
  File "/usr/local/lib/python2.7/site-packages/pycalico/ipam.py", line 412, in auto_assign_ips
    attributes, pool[0])
  File "/usr/local/lib/python2.7/site-packages/pycalico/ipam.py", line 465, in _auto_assign
    attributes)
  File "/usr/local/lib/python2.7/site-packages/pycalico/ipam.py", line 542, in _auto_assign_block
    affinity_check=affinity_check)
  File "/usr/local/lib/python2.7/site-packages/pycalico/block.py", line 170, in auto_assign
    (self.host_affinity, affinity_id))
NoHostAffinityWarning: Host affinity is host-0282 (not host-0268)

Could path prefix be configurable ?

currently we got etcd structure like

+--calico  # root namespace
   |
   |--v1
   |--ipam
   |--bgp
   |--felix
   |--libnetwork

but i wanna run different networks with calico, using the same etcd server, which might be

+--<prefix>  # root namespace
   |
   |--calico
      |
      |--v1
      |--ipam
      |--bgp
      |--felix
      |--libnetwork

i found path of v1/ipam/bgp is defined in libcalico, path of libnetwork defined in libnetwork-plugin, and path of felix defined in another place... and i found it's hard to set a same prefix via env or else..

Unhandled exception when using wrong etcd address

Running calicoctl with an incorrect IP address results in stack trace showing a ConnectionTimeoutError (see stack trace).

We could fix up by catching this in calicoctl, but perhaps better to fix in datastore and rethrow the exception as a DataStoreError ?

calico@calicodev:~/calico-docker$ sudo ETCD_AUTHORITY=1.2.3.4:1234 ./calico_containers/calicoctl.py node
WARNING: Unable to detect the xt_set module. Load with modprobe xt_set
WARNING: Unable to detect the ipip module. Load with modprobe ipip
No IP provided. Using detected IP: 10.0.2.15
Unexpected error executing command.

Traceback (most recent call last):
File "./calico_containers/calicoctl.py", line 101, in
getattr(command_module, command)(arguments)
File "/home/calico/calico-docker/calico_containers/calico_ctl/node.py", line 164, in node
kubernetes=arguments.get("--kubernetes"))
File "/home/calico/calico-docker/calico_containers/calico_ctl/node.py", line 204, in node_start
warn_if_hostname_conflict(ip)
File "/home/calico/calico-docker/calico_containers/calico_ctl/node.py", line 391, in warn_if_hostname_conflict
current_ipv4, _ = client().get_host_bgp_ips(hostname)
File "/usr/local/lib/python2.7/dist-packages/pycalico/datastore.py", line 82, in wrapped
return fn(_args, *_kwargs)
File "/usr/local/lib/python2.7/dist-packages/pycalico/datastore.py", line 229, in get_host_bgp_ips
ipv4 = self.etcd_client.read(bgp_ipv4).value
File "/usr/local/lib/python2.7/dist-packages/etcd/client.py", line 356, in read
timeout=timeout)
File "/usr/local/lib/python2.7/dist-packages/etcd/client.py", line 582, in api_execute
preload_content=False)
File "/usr/local/lib/python2.7/dist-packages/urllib3/request.py", line 75, in request
*_urlopen_kw)
File "/usr/local/lib/python2.7/dist-packages/urllib3/request.py", line 88, in request_encode_url
return self.urlopen(method, url, *_urlopen_kw)
File "/usr/local/lib/python2.7/dist-packages/urllib3/poolmanager.py", line 155, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 541, in urlopen
body=body, headers=headers)
File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 370, in _make_request
(self.host, timeout_obj.connect_timeout))
ConnectTimeoutError: (<urllib3.connectionpool.HTTPConnectionPool object at 0x7fe8b80db550>, 'Connection to 1.2.3.4 timed out. (connect timeout=60)')

ipv4 block pool without affinity host setting issue

Our system is kind of old

calico_node-libnetwork:v0.8.0
calico_node:v0.20.0

There are some ipv4 block pool missing affinity setting
such as this

/calico/ipam/v2/assignment/ipv4/block/172.20.0.0-26

# etcdctl get /calico/ipam/v2/assignment/ipv4/block/172.20.0.0-26
{"affinity": "", "strict_affinity": false, "allocations": [0, 0, 0, null, null, null, 0, null, null, null, null, null, null, 0, null, null, null, null, null, null, 0, null, null, null, null, 0, null, null, null, null, null, null, null, null, null, null, null, null, 0, 0, 0, null, null, null, null, 0, null, null, null, null, null, null, 0, null, null, null, null, null, null, 0, null, null, null, null], "unallocated": [34, 11, 32, 48, 57, 19, 47, 56, 26, 30, 3, 21, 16, 12, 8, 60, 14, 9, 18, 28, 15, 29, 42, 61, 55, 33, 54, 58, 49, 10, 44, 53, 63, 24, 4, 35, 41, 7, 22, 62, 31, 17, 46, 51, 5, 50, 23, 27, 43, 37, 36], "attributes": [{"handle_id": null, "secondary": {}}], "cidr": "172.20.0.0/26"}


And this is causing issue when allocating docker container inside this ip pool,
bellow is the log from alico_node-libnetwork:

...
    block = AllocationBlock.from_etcd_result(result)
  File "/usr/lib/python2.7/site-packages/pycalico/block.py", line 134, in from_etcd_result
    assert affinity[:5] == "host:"
AssertionError
....

Not sure if I can assign an affinity host (which already have a ipv4 pool) to this /calico/ipam/v2/assignment/ipv4/block/172.20.0.0-26 for fixing, will that making chaos?

Thanks!

calicoctl uses the IP address for the cbr0 docker bridge

The calicoctl node command attempts to detect the which IP it should use. Currently we ignore the docker0 bridge but otherwise just use the address on the first interface that we find.

Kubernetes docs, recommend using a docker bridge called cbr0 - calico won't ignore this interface and can end up using this address.

Removing an IP pool doesn't remove the IPAM affinity

@Symmetric noticed this guy.

Reproduction:

  1. Use Calico IPAM to assign an IP address - this causes a block to be associated with the given host.
  2. Delete the Calico IP pool that contains that block, add a new IP pool instead.
  3. Assign another IP address - the assigned address will be allocated from the original pool (now deleted) since IPAM still has a hold on the block.

Looks like IPAM needs to make sure the IP pool still exists before allocating an address, and choosing a new pool if it doesn't exist.

`make ut` fails: ERROR: unsatisfiable constraints: py2-pip (missing)

Make target for unit tests executions fails, because 'calico/test' image can't be built:

  • make ut
    docker build -f Dockerfile.calico_test -t calico/test:latest .
    Sending build context to Docker daemon 557.1 kB
    Sending build context to Docker daemon 618.5 kB

Step 1 : FROM docker
---> 9b2086b6e30d
Step 2 : MAINTAINER Tom Denham [email protected]
---> Using cache
---> 0f09ac930f23
Step 3 : RUN apk add --update python python-dev py2-pip py-setuptools openssl-dev libffi-dev git musl-dev gcc tshark netcat-openbsd iptables ip6tables iproute2 iputils ipset curl && curl -o glibc.apk -L "https://github.com/andyshinn/alpine-pkg-glibc/releases/download/2.23-r1/glibc-2.23-r1.apk" && apk add --allow-untrusted glibc.apk && curl -o glibc-bin.apk -L "https://github.com/andyshinn/alpine-pkg-glibc/releases/download/2.23-r1/glibc-bin-2.23-r1.apk" && apk add --allow-untrusted glibc-bin.apk && /usr/glibc-compat/sbin/ldconfig /lib /usr/glibc/usr/lib && echo 'hosts: files mdns4_minimal [NOTFOUND=return] dns mdns4' >> /etc/nsswitch.conf && rm -f glibc.apk glibc-bin.apk && rm -rf /var/cache/apk/*
---> Running in 387115cd2ce1
fetch http://dl-cdn.alpinelinux.org/alpine/v3.4/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.4/community/x86_64/APKINDEX.tar.gz
ERROR: unsatisfiable constraints:
py2-pip (missing):
required by: world[py2-pip]
The command '/bin/sh -c apk add --update python python-dev py2-pip py-setuptools openssl-dev libffi-dev git musl-dev gcc tshark netcat-openbsd iptables ip6tables iproute2 iputils ipset curl && curl -o glibc.apk -L "https://github.com/andyshinn/alpine-pkg-glibc/releases/download/2.23-r1/glibc-2.23-r1.apk" && apk add --allow-untrusted glibc.apk && curl -o glibc-bin.apk -L "https://github.com/andyshinn/alpine-pkg-glibc/releases/download/2.23-r1/glibc-bin-2.23-r1.apk" && apk add --allow-untrusted glibc-bin.apk && /usr/glibc-compat/sbin/ldconfig /lib /usr/glibc/usr/lib && echo 'hosts: files mdns4_minimal [NOTFOUND=return] dns mdns4' >> /etc/nsswitch.conf && rm -f glibc.apk glibc-bin.apk && rm -rf /var/cache/apk/*' returned a non-zero code: 1
Makefile:45: recipe for target 'calico_test.created' failed
make: *** [calico_test.created] Error 1

The failure is caused by an old base image (docker) which is based on alpine linux v3.4 that lacks 'py2-pip' package. Current 'latest' points to 3.5 version which has 'py2-pip', so docker pull docker fixes the issue locally. I think base image in Dockerfile.calico_test should be pinned to some version in order to prevent such failures in the future.

IPAM fail when `masquerade` is set to `true`

I use calicoctl pool add 10.200.0.0/16 --nat-outgoing to create pool, then I use calicoctl container add 3e 10.200.0.0/16 --interface veth0.1 to add network from pool to container 3e, it failed.

Error was: ValueError: Requested pool 10.200.0.0/16 is not configured or haswrong attributes.

But if I remove "masquerade": true in etcd, it works. And after this I add "masquerade": true back to etcd, it still works.

I run calico-node in docker with the latest version released about 2 days ago, calicoctl v0.13.0.

Specifying etcd endpoints doesn't work with fqdns

I'm calling calicoctl with:

ETCD_ENDPOINTS=https://1.calico.etcd.local.:2381,https://2.calico.etcd.local.:2381,https://3.calico.etcd.local.:2381

But it aborts with:

Invalid ETCD_ENDPOINTS. Address must take the form <address>:<port>. Value
provided is '1.calico.etcd.local.:2381'

It should be possible to use fully qualified domain names.

libcalico leaks etcd.EtcdConnectionFailed

libcalico will raise EtcdConnectionFailed if unable to connect with etcd:

dano:~/ $ ETCD_AUTHORITY=172.25.20.13:2379 python                                                          [10:07:51]
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from pycalico.ipam import IPAMClient
>>> datastore=IPAMClient()
>>> datastore.auto_assign_ips(1,2,"123",{},host="test")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pycalico/ipam.py", line 418, in auto_assign_ips
    pool[0], host)
  File "/usr/local/lib/python2.7/dist-packages/pycalico/ipam.py", line 455, in _auto_assign
    pool)
  File "/usr/local/lib/python2.7/dist-packages/pycalico/ipam.py", line 106, in _get_affine_blocks
    result = self.etcd_client.read(path, quorum=True).children
  File "/usr/local/lib/python2.7/dist-packages/etcd/client.py", line 536, in read
    timeout=timeout)
  File "/usr/local/lib/python2.7/dist-packages/etcd/client.py", line 834, in wrapper
    cause=e
etcd.EtcdConnectionFailed: Connection to etcd failed due to MaxRetryError("HTTPConnectionPool(host='172.25.20.13', port=2379): Max retries exceeded with url: /v2/keys/calico/ipam/v2/host/test/ipv4/block/?quorum=true (Caused by <class 'socket.error'>: [Errno 111] Connection refused)",)

I propose we catch these errors and reraise them as DatastoreConnectionError which inherits from DatastoreError

Default ASN is Not in Private Range

It's using 64511, where the private range begins at 64512 [1].

Value as seen in Etcd:

ip-10-139-4-50 ~ # etcdctl get /calico/bgp/v1/global/as_num
64511

From the RFC:

IANA has reserved, for Private Use, a contiguous block of 1023
Autonomous System numbers from the "16-bit Autonomous System Numbers"
registry, namely 64512 - 65534 inclusive.

Where this default value lies:

1: https://tools.ietf.org/html/rfc6996

Improve perf by removing quorum reads

libcalico currently uses quorum reads on IPAM interactions.

Quorum reads are slower because they require interaction with a quorum of nodes. We could improve perf by instead, tracking the index of every successful or failed write, and having our reads be >= that index. Since the datastore object can be made stateful, we can do this without changing the libcalico API.

ST framework tidy required

Couple of areas for ST framework tidy-up:

  • Logging needs to be synchronized now that the assert connectivity is multi-threaded. Probably best to implement a synchronized LoggingHandler class and to register that with logging

  • The assert connectivity processing uses a workload container, but the workload container is defined in the calico-containers repo. I think that should be moved into this repo and (provided it isn't massive) use that as the default workload container.

The hostname argument for auto_assign_ips is not respected

When you call auto_assign_ips, there is an optional hostname argument (defaulting to the hostname of this server). In principle, that means that you can have a name for a server used in IPAM that is not the same as the server hostname. Unfortunately, if you don't set it to the hostname of this server, then it doesn't work, which makes it a bit pointless.

The underlying issue is that _auto_assign_block is called, and that raises an exception if the block in question does not match the server hostname (not the hostname parameter). One fix would be to remove that check, but that seems a bit crazy (it's a valid check). Another fix would be to pass the supplied hostname down to _auto_assign_block, which I think is probably the correct fix.

I can probably put together a PR for this, but it'd be good to get confirmation that you agree with my analysis in case I've misunderstood how to use the interface.

Here's a stack trace of what happened when I tried this (dummy-0000 is the hostname of the box, not what I passed, and this is build 0.3.0) :

Traceback (most recent call last):
  File "./host-agent.py", line 532, in <module>
    main()
  File "./host-agent.py", line 374, in main
    do_ipam(name)
  File "./host-agent.py", line 342, in do_ipam
    hostname=name)
  File "/usr/local/lib/python2.7/site-packages/pycalico/ipam.py", line 409, in auto_assign_ips   
    attributes, pool[0], hostname)
  File "/usr/local/lib/python2.7/site-packages/pycalico/ipam.py", line 484, in _auto_assign
    attributes)
  File "/usr/local/lib/python2.7/site-packages/pycalico/ipam.py", line 538, in _auto_assign_block
    affinity_check=affinity_check)
  File "/usr/local/lib/python2.7/site-packages/pycalico/block.py", line 169, in auto_assign
    self.host_affinity)
pycalico.block.NoHostAffinityWarning: Host affinity is dummy-0000

calicoctl raises `NameError` due to missing import

calicoctl raises NameError exception, because EtcdAlreadyExist is not imported.

Oct  7 13:41:49 ubuntu1604 calicoctl[10428]: Traceback (most recent call last):
Oct  7 13:41:49 ubuntu1604 calicoctl[10428]:   File "startup.py", line 310, in <module>
Oct  7 13:41:49 ubuntu1604 calicoctl[10428]:     main()
Oct  7 13:41:49 ubuntu1604 calicoctl[10428]:   File "startup.py", line 288, in main
Oct  7 13:41:49 ubuntu1604 calicoctl[10428]:     client.ensure_global_config()
Oct  7 13:41:49 ubuntu1604 calicoctl[10428]:   File "/usr/lib/python2.7/site-packages/pycalico/datastore.py", line 130, in wrapped
Oct  7 13:41:49 ubuntu1604 calicoctl[10428]:     return fn(*args, **kwargs)
Oct  7 13:41:49 ubuntu1604 calicoctl[10428]:   File "/usr/lib/python2.7/site-packages/pycalico/datastore.py", line 296, in ensure_global_config
Oct  7 13:41:49 ubuntu1604 calicoctl[10428]:     self._ensure_cluster_guid(CLUSTER_GUID_PATH)
Oct  7 13:41:49 ubuntu1604 calicoctl[10428]:   File "/usr/lib/python2.7/site-packages/pycalico/datastore.py", line 314, in _ensure_cluster_guid
Oct  7 13:41:49 ubuntu1604 calicoctl[10428]:     except EtcdAlreadyExist:
Oct  7 13:41:49 ubuntu1604 calicoctl[10428]: NameError: global name 'EtcdAlreadyExist' is not defined

implement adjustable timeouts

libcalico does not support adjustable timeouts, and allows certain calls to etcd to block for as long 60 seconds. For example:

dano:~/ $ ETCD_AUTHORITY=10.0.0.200:2379 python                                                             [9:49:27]
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from pycalico.ipam import IPAMClient
>>> datastore=IPAMClient()
>>> datastore.auto_assign_ips(1,2,"123",{},host="test")

Python blocks for 60 seconds before emitting the following message:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pycalico/ipam.py", line 418, in auto_assign_ips
    pool[0], host)
  File "/usr/local/lib/python2.7/dist-packages/pycalico/ipam.py", line 455, in _auto_assign
    pool)
  File "/usr/local/lib/python2.7/dist-packages/pycalico/ipam.py", line 106, in _get_affine_blocks
    result = self.etcd_client.read(path, quorum=True).children
  File "/usr/local/lib/python2.7/dist-packages/etcd/client.py", line 536, in read
    timeout=timeout)
  File "/usr/local/lib/python2.7/dist-packages/etcd/client.py", line 834, in wrapper
    cause=e
etcd.EtcdConnectionFailed: Connection to etcd failed due to ConnectTimeoutError(<urllib3.connectionpool.HTTPConnectionPool object at 0x7f0d616604d0>, 'Connection to 10.0.0.200 timed out. (connect timeout=60)')

60 seconds of block time is unacceptable to mesos, which begins to reinitialize tasks if they are staged for that long. We should implement a default or parameterized timeout variable so that etcd calls can be aborted.

Allow specification of tags when creating new profile

Similar to how rules are implemented, so we don't have to perform multiple libcalico calls to configure a profile with tags.

def create_profile(self, name, rules=None):

becomes

def create_profile(self, name, rules=None, tags=None):

Incorrect calculation of length in block.py

There are a couple of bugs in block.py which seem to cancel each other out, but could be tidied up (because I found them very confusing looking at the code).

PREFIX_MASK = {4: (IPAddress("255.255.255.255") ^ (BLOCK_SIZE - 1)),
               6: (IPAddress("ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff") ^
                   (BLOCK_SIZE - 1))}

Sticking to IP v4, and with BLOCK_SIZE of 8, that leaves you with a mask of 255.255.255.248, when you really want 255.255.255.0. test_get_block_cidr still works though - it converts 10.34.11.75 to 10.34.11.72/24 - which is equal to 10.34.11.0/24 as an IPNetwork, even though it probably isn't what you want.

This should be fixed so that the test compares against a string rather than comparing against an IPNetwork to ensure that you really are getting what you want.

Invalid mac address returned if `ip` prints to stderr

Function get_ns_veth_mac netns.py#L245 returns invalid result if ip netns ... command printed message in stderr.

ip netns command can print RTNETLINK answers: Invalid argument to stderr and still finish with exitcode 0, which causes next endpoint written to etcd:

{"name": "cali76e50b0684a", "labels": {}, "state": "active", "ipv6_nets": [], "mac": "RTNETLINK answers: Invalid argument\n2a:1b:78:a8:63:a4", "ipv4_nets": ["10.233.67.164/32"], "profile_ids": ["calico-k8s-network"]}

The root cause of this is that function NamedNamespace.check_output redirects stderr to stdout.

Solution is to read output only from stdout.

Rule objects don't validate parameters

When creating a libcalico rule, a minimal amount of validation is performed. It would be significantly easier to debug policy problems if the given arguments (src_net, dst_tag, etc) were validated before being programmed to etcd so that errors in policy could be detected before being picked up by Felix.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.