nixos / infra Goto Github PK
View Code? Open in Web Editor NEWNixOS configurations for nixos.org and its servers
License: MIT License
NixOS configurations for nixos.org and its servers
License: MIT License
Right now there are a few groups of managed build nodes. This makes it difficult to ensure they're all running consistent versions of Nix and NixOS, and are participating in the monitoring.
Import the x86 Linux machines in to the network:
Import the macOS-on-Linux machines in to the network:
Convert the remaining macOS-on-Darwin machines to be macOS-on-Linux and import in to the network:
Import the aarch64-linux machines in to the network:
I noticed that LLVM wasn't available in the binary cache, but built fine on machine. It looks like the build machine is running out of memory:
g++: fatal error: Killed signal terminated program cc1plus
I noticed this because Anki fails to build in the nixpkgs-20.03-darwin
jobset with OSError: [Errno 24] Too many open files
.
The same derivation can build locally with a higher limit. There's a sample LaunchDaemon to configure this here: https://eradman.com/entrproject/limits.html
I'm not sure what the consequences of this for virtualization are (like, does the guest use the host's file limit).
cc @grahamc because you wrote the README for the MacOS infrastructure 🙏
I was trying to figure out why nixpkgs-unstable
wasn't updating a few days, and stumbled upon this page: https://hydra.nixos.org/build/139583183
It's filled with a bunch of messages like:
--- Error --- hydra-queue-runner�[0m cannot connect to ‘�[33;[email protected]�[0m’: �[33;1m@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: POSSIBLE DNS SPOOFING DETECTED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ The ED25519 host key for c5517495.packethost.net has changed, and the key for the corresponding IP address 2604:1380:2001:2000::d is unknown. This could either mean that DNS SPOOFING is happening or the IP address for the host and its host key have changed at the same time. @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that a host key has just been changed. The fingerprint for the ED25519 key sent by the remote host is SHA256:MEs3I20z6zLDCoRXxDwb41ivxoR+o1a+O5HHE6t6dmc. Please contact your system administrator. Add correct host key in /tmp/nix-29595-575205/host-key to get rid of this message. Offending ED25519 key in /tmp/nix-29595-575205/host-key:1 ED25519 host key for c5517495.packethost.net has changed and you have requested strict checking. Host key verification failed.�[0m
It also has some successful builds, so maybe it's a transient issue.
Creating this issue based on this Discourse thread.
Ike has been failing builds transiently a few times now, including in the middle of a webkit build that didn't use network access or anything, meaning that the issue appears to be RAM or disk-related.
And this doesn't sound good, because if the error occurs in eg. the final build-product, it may actually pass the compilation step and generate a broken binary.
In addition, these failing builds slow down the channel, as they need to be manually restarted for the channel to move forward.
I understand that ike is currently one of the most powerful machines in hydra, but maybe it'd make sense to try to disable it and see whether channel actually bump faster? (as there should be less transient failures)
This breaks the nix log
command.
Example:
$ nix-store -curl https://cache.nixos.org/7k3vlmpvzjqrc7lbmz1csbhg5d1rn4fw.narinfo
StorePath: /nix/store/7k3vlmpvzjqrc7lbmz1csbhg5d1rn4fw-glibc-2.31
URL: nar/1236ysxagw0gv01cw4i4dsbm257lzpq8mxkpjzmi6pg7p0nqb0ac.nar.xz
Compression: xz
FileHash: sha256:1236ysxagw0gv01cw4i4dsbm257lzpq8mxkpjzmi6pg7p0nqb0ac
FileSize: 6401092
NarHash: sha256:12k2hciiibf4x68957x7234kbykw0szpq9gidn10qvpxckm7rm0l
NarSize: 30519536
References: 7k3vlmpvzjqrc7lbmz1csbhg5d1rn4fw-glibc-2.31 z1sxk8d5z9cn89pv46h800lkqjl22g67-libidn2-2.3.0
Sig: cache.nixos.org-1:+Vr1hG9RWRRZrBlI2u/ZPA7sfMkF2PISLyiP/sa+zMrjm1TjxDVqkCrh48nWLTipYISPg3QPL5LjL23xi9kNAQ==
whereas previously:
$ curl https://cache.nixos.org/5ka41zhii1bjss3f60rzd2npz9mxj060.narinfo
StorePath: /nix/store/5ka41zhii1bjss3f60rzd2npz9mxj060-glibc-2.27
URL: nar/0rjjy0l8a5z7ajk4zprvnwqbmcgaqlnpsd2v80kdb3frlw9gkgh8.nar.xz
Compression: xz
FileHash: sha256:0rjjy0l8a5z7ajk4zprvnwqbmcgaqlnpsd2v80kdb3frlw9gkgh8
FileSize: 6152544
NarHash: sha256:1wj2wa8hkb792q9qp6rck87315jldixjp7kaw1rz0r7sh09kwblx
NarSize: 27158944
References: 5ka41zhii1bjss3f60rzd2npz9mxj060-glibc-2.27
Deriver: nq8gcjq461ijxj2s28xjikyjan6xyq90-glibc-2.27.drv
Sig: cache.nixos.org-1:1J0Aafo5T+O/W032Tz1t/ql7UO1fOCu3PWHpwS6KGATWN7nmn9DJXo9esu1ir7Nnfii7cuZ1xpZblQ0IElhdDg==
@rbvermaa are you handling this service? http://nixos.org/irc/logs/
The IRC logs no longer work, last one was in July.
The user mog on #nixos has logs for that time period.
Apparently for some machines the scheduler thinks they have the "big-parallel" feature but the builder itself disagrees. We're then getting abortions like
Aborted: �[31;1merror:�[0m�[34;1m --- Error --- nix-store�[0m a '�[33;1maarch64-linux�[0m' with features {�[33;1mbig-parallel�[0m} is required to build '�[33;1m/nix/store/kjvs406gd1gxa90fz614krnasv3p8wym-llvm-9.0.1.drv�[0m', but I am a '�[33;1maarch64-linux�[0m' with features {�[33;1mkvm, nixos-test, recursive-nix�[0m}
A current case was aarch64 but maybe I saw x86_64 kernel a couple days ago.
Here is a rough sorting of domains to drop / keep, based on what just redirected back to nixos.org, domains I'd never heard of before, or were showing a "website maintenance" page, plus a couple I have heard of but nobody uses:
Drop:
ts.nixos.org
svn.nixos.org
status.nixos.org * never updated since I came around ...
stan.nixos.org
releases.nixos.org * I wonder how much traffic this gets
mturk.nixos.org
monitor.nixos.org
lucifer.nixos.org
losser.nixos.org
barbrady.nixos.org
Keep:
cache.nixos.org
hydra.nixos.org
weekly.nixos.org
conf.nixos.org
tarballs.nixos.org
planet.nixos.org
Domain list from #33 (comment)
This will allow us to get rid of the EC2 web server.
Despite services.nginx.virtualHosts."monitoring.nixos.org".enableACME
being set, the certificate expired yesterday (2021-03-18). This has a knock-on effect with status.nixos.org no longer working (see also NixOS/nixos-status#9).
I'm sure @zimbatm would like to handle this one :)
Dec 20 03:22:24 webserver systemd[1]: Started Update Channel nixos-18.09.
Dec 20 03:22:29 webserver update-nixos-18.09-start[9256]: release is ‘nixos-18.09.1761.9bacb8289bb’ (build 86094175), eval is 1496550, prefix is nixos/18.09/nixos-18.09.1761.9bacb8289bb, Git commit is 9bacb8289bbd401988d94aacea83efbe225ebc1a
Dec 20 03:22:29 webserver update-nixos-18.09-start[9256]: Net::Amazon::S3: Amazon responded with 403 Forbidden
Dec 20 03:22:29 webserver update-nixos-18.09-start[9256]: at /nix/store/5ajpkdh6byxjfayn97q2xksdii71w5m6-perl-Net-Amazon-S3-0.80/lib/perl5/site_perl/5.24.3/Net/Amazon/S3/Bucket.pm line 151.
Dec 20 03:22:29 webserver systemd[1]: update-nixos-18.09.service: Main process exited, code=exited, status=255/n/a
Dec 20 03:22:29 webserver systemd[1]: update-nixos-18.09.service: Unit entered failed state.
Dec 20 03:22:29 webserver systemd[1]: update-nixos-18.09.service: Failed with result 'exit-code'.
# systemctl status mirror-tarballs | cat
● mirror-tarballs.service - Mirror Nixpkgs Tarballs
Loaded: loaded (/nix/store/184y8myyyw134ri16d844ccwm17wxfkg-unit-mirror-tarballs.service/mirror-tarballs.service; linked; vendor preset: enabled)
Active: failed (Result: exit-code) since Tue 2019-01-22 05:31:17 CET; 5h 42min ago
Process: 2448 ExecStart=/nix/store/nrv6r883198m3s9cf49v4xjl5haclcwc-unit-script-mirror-tarballs-start (code=exited, status=1/FAILURE)
Main PID: 2448 (code=exited, status=1/FAILURE)
Jan 22 05:31:13 bastion nrv6r883198m3s9cf49v4xjl5haclcwc-unit-script-mirror-tarballs-start[2448]: trace: stdenv.isArm is deprecated after 18.03
Jan 22 05:31:13 bastion nrv6r883198m3s9cf49v4xjl5haclcwc-unit-script-mirror-tarballs-start[2448]: trace: stdenv.isArm is deprecated after 18.03
Jan 22 05:31:17 bastion nrv6r883198m3s9cf49v4xjl5haclcwc-unit-script-mirror-tarballs-start[2448]: GC Warning: Failed to expand heap by 8388608 bytes
Jan 22 05:31:17 bastion nrv6r883198m3s9cf49v4xjl5haclcwc-unit-script-mirror-tarballs-start[2448]: GC Warning: Failed to expand heap by 65536 bytes
Jan 22 05:31:17 bastion nrv6r883198m3s9cf49v4xjl5haclcwc-unit-script-mirror-tarballs-start[2448]: GC Warning: Out of Memory! Heap size: 3690 MiB. Returning NULL!
Jan 22 05:31:17 bastion nrv6r883198m3s9cf49v4xjl5haclcwc-unit-script-mirror-tarballs-start[2448]: error: out of memory
Jan 22 05:31:17 bastion nrv6r883198m3s9cf49v4xjl5haclcwc-unit-script-mirror-tarballs-start[2448]: ./maintainers/scripts/copy-tarballs.pl: evaluation failed
Jan 22 05:31:17 bastion systemd[1]: mirror-tarballs.service: Main process exited, code=exited, status=1/FAILURE
Jan 22 05:31:17 bastion systemd[1]: mirror-tarballs.service: Failed with result 'exit-code'.
Jan 22 05:31:17 bastion systemd[1]: Failed to start Mirror Nixpkgs Tarballs.
In the event that all our infrastructure goes down, we want to be notified as soon as possible.
It could be that our domain has expired.
It could be that one of our partner decided to pull the plug.
It could be that we got hacked really bad.
In any case, it would be nice to receive a SMS if that was the case.
It would be easier to figure out and fix issues like nixos-homepage#232 and #41 with log access.
Monitoring the website build, even for simple "hey, look, I failed", would at least let us act more quickly into figuring out a solution. A month is kinda slow :/.
(I have no actual knowledge or propositions for this.)
Instead, let's consider running them sequentially. The memory pressure on the instance is too high for it to get two done at once.
Issue to centralize some of the issues some users are having with cache.nixos.org after the Fastly.com switch.
Oct 20 13:17:45 ceres hydra-queue-runner[10314]: possibly transient failure building ‘/nix/store/gyq9h07h6iph8fdy9pa873w720zwkcm0-ghc-8.11.20200824.drv’ on ‘root@mac1-guest’: error: --- Error --- hydra-queue-runner
Oct 20 13:17:45 ceres hydra-queue-runner[10314]: cannot connect to ‘root@mac1-guest’: [email protected]: Permission denied (publickey).
Oct 20 13:18:47 ceres hydra-queue-runner[10314]: possibly transient failure building ‘/nix/store/gyq9h07h6iph8fdy9pa873w720zwkcm0-ghc-8.11.20200824.drv’ on ‘root@mac9-guest’: error: --- Error --- hydra-queue-runner
Oct 20 13:18:47 ceres hydra-queue-runner[10314]: cannot connect to ‘root@mac9-guest’: kex_exchange_identification: Connection closed by remote host
build-cores = 0
Nix config is missing for EC2 provisioned machines, I don't see they include delft/common.nix
. This is why chromium takes 8h to compile, in reality it should take 40-60min (plus the load).
When I type nixos.org
in browser URL, it doesn't redirect to https://nixos.org
I think, this should be done for, at least, http://nixos.org/nixos/security.html page.
Now that 20.03 is out (congrats!), it would be great to have official AMIs out there as well, otherwise this page looks broken:
https://nixos.org/download.html (tab Amazon EC2).
In this discussion on Discourse @grahamc mentions that maybe release documentation needs updating to include building AMIs as the part of the release.
See NixOS/nix#75 (comment), Hydra already signs packages, so the cache would need to copy the signatures and make signatures for already-made packages.
According to this document https://aws.amazon.com/blogs/aws/aws-ipv6-update-global-support-spanning-15-regions-multiple-aws-services/ there is ipv6 support for eu-west-1, where the website is apparently hosted. When we also have ipv6 support for the homepage, ipv6 support is complete.
$ dig AAAA nixos.org
; <<>> DiG 9.10.4-P6 <<>> AAAA nixos.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 27168
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;nixos.org. IN AAAA
;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Tue Apr 25 10:03:41 CEST 2017
;; MSG SIZE rcvd: 38
Currently the certificate for cache.nixos.org is not recognized by some browsers and is manually provisioned. So it would be better to use ACM.
To make it as easy as possible for users to connect to the community, it would be great if nixos.org would serve a web interface with history for the IRC channel where people can use IRC without a client.
A good candidate seem to be https://github.com/ircanywhere/ircanywhere .
The region is in the AMI upload script but not in the AMI list https://github.com/NixOS/nixpkgs/blob/master/nixos/modules/virtualisation/ec2-amis.nix.
I think that's achieved by flipping metric.current
in prometheus: https://github.com/NixOS/nixos-org-configurations/blob/3294705/delft/eris/status-page/status.js#L120
Currently displays "loading from Prometheus" forever, Prometheus API redirecting to 0.0.0.0.
(@grahamc is already aware, just filing this as an issue as well)
It seems that something is dropping ICMP packet too big messages along the path, breaking path MTU discovery:
$ ping -6 nixos.org -s 1444
PING nixos.org(2a05:d014:275:cb00:ec0d:12e2:df27:aa60 (2a05:d014:275:cb00:ec0d:12e2:df27:aa60)) 1444 data bytes
1452 bytes from 2a05:d014:275:cb00:ec0d:12e2:df27:aa60 (2a05:d014:275:cb00:ec0d:12e2:df27:aa60): icmp_seq=1 ttl=48 time=16.1 ms
^C
$ ping -6 nixos.org -s 1445
PING nixos.org(2a05:d014:275:cb00:ec0d:12e2:df27:aa60 (2a05:d014:275:cb00:ec0d:12e2:df27:aa60)) 1445 data bytes
^C
Here's a tracepath
:
1?: [LOCALHOST] 0.005ms pmtu 1500
1: redacted.dip.versatel-1u1.de 0.660ms
1: redacted.dip.versatel-1u1.de 0.437ms
2: redacted.dip.versatel-1u1.de 0.379ms pmtu 1492
2: redacted 1.384ms
3: redacted 8.840ms
4: 2001:1438:0:1::5:1f2 9.661ms
5: 2001:1438:0:1::5:72 14.331ms
6: fra1-edge1.digitalocean.com 15.400ms
7: 2604:a880:ffff:5::43a 35.559ms
8: no reply
9: no reply
10: no reply
11: no reply
Here's hydra.nixos.org for comparison:
1?: [LOCALHOST] 0.013ms pmtu 1500
1: redacted.dip.versatel-1u1.de 0.486ms
1: redacted.dip.versatel-1u1.de 2.604ms
2: redacted.dip.versatel-1u1.de 0.371ms pmtu 1492
2: redacted 1.084ms
3: redacted 14.956ms
4: 2001:1438:0:1::5:1f2 7.794ms
5: 2001:1438:0:1::5:72 14.621ms
6: 2a01:4f8:0:e0f0::29 14.935ms
7: core1.fra.hetzner.com 20.925ms
8: core23.fsn1.hetzner.com 20.273ms
9: ex9k1.dc5.fsn1.hetzner.com 19.310ms
10: 2a01:4f8:140:244c:: 19.098ms reached
Resume: pmtu 1492 hops 10 back 10
To do this, we should:
and then, in order:
As recently discussed with @vcunat and others on IRC, I think the hydra email notifications are very valuable and should be re-enabled. They were disabled in march after some spam issues. However without those, as a maintainer its hard to actually maintain a package (e.g. react to failures).
IRC logs:
12:13 <timokau[m]> Whats the status of hydra emailing maintainers? Was that just never re-activated after the spam? Or was the issue never fixed?
12:23 <vcunat> Never reactivated AFAIK.
12:24 <ekleog> oh :/
12:25 <timokau[m]> Can we just do that, or is there some work involved?
12:27 <vcunat> Here's the line
12:27 <vcunat> https://github.com/NixOS/nixos-org-configurations/blame/1dfde8a7cc461cacf338250a8a2eb53b0e0bd72c/delft/chef.nix#L54
12:27 <vcunat> I don't know how Hydra's mail works on the inside. (e.g. if it will try to re-send those mails or something)
12:27 <vcunat> niksnut: ^^
12:39 <niksnut> yeah it's disabled
12:41 <vcunat> and expected not to cause trouble if simply re-enabled?
12:42 <vcunat> My guess would be that the problem was that the *first* evaluation happenned with the feature on. On subsequent evaluations I'd expect only status changes would be e-mailed, but I might easily be wrong.
12:48 <niksnut> IMHO email notification is not really worth it
12:48 <niksnut> it causes more problems than it's worth, and most users don't care for it
13:10 <LnL> can people without an account access the maintainers page on hydra?
13:11 <LnL> doesn't look like it https://hydra.nixos.org/dashboard/[email protected]#tabs-my-jobs
13:13 <LnL> also not everything is in there
13:16 <LnL> oh, meta.maintainers is broken on hyra Maintainer(s):not given
13:16 <LnL> niksnut: ^
13:36 <timokau[m]> vcunat: niksnut: I think email notifications are very much worth it. How else are maintainers supposed to notice that their packages break? I think it is a very important step in minimizing hydra failures.
13:40 <vcunat> timokau[m]: yes, I don't know a better way ATM.
13:40 <vcunat> Most maintainers didn't react to the e-mails apparently, but if working reasonably reliably, the feature would seem a nice to have.
13:44 <timokau[m]> Yes and it was working reliably until the spam. If that is still a concern, maybe some stupid rate limiting would reduce the risk. Or worst case we could at least make it possible to opt-in.
13:44 <vcunat> I occasionally did get some weird messages for it for builds that were months old.
13:44 <vcunat> s/for it/from it/
13:47 <aminechikhaoui> vcunat: I saw that also in our private hydra, I think it has to do with the attempted fix here but not sure https://github.com/NixOS/hydra/pull/566
13:48 <aminechikhaoui> but it basically happens every time we restart the queue runner
13:48 <vcunat> well first we need to fix filling the maintainer colon, as without that data there won't be anyone to send to
13:49 <vcunat> (except for those messages: "your commit may have broken this build")
13:50 <vcunat> Eh, not "colon", but I guess you know what I mean :-)
13:51 <timokau[m]> In my opinion a few false-positives would be better than no positives at all :)
13:52 <timokau[m]> I didn't know there was also that kind of message. Aren't that usually a lot of commits?
14:14 <vcunat> It certainly happened commonly that there were many.
14:14 <vcunat> > This may be due to 640 commits by ... (long list of authors)
14:19 <timokau[m]> Those messages should probably be disabled. Or even better only sent if up to X commits might be responsible.
14:32 <timokau[m]> And the problem with maintainers is that the parsing was just never adapted to the new maintainers format?
14:54 <vcunat> It's possible. I don't know if anything was attempted.
We have a bunch of packages that are failing to build due to missing SSE 4.2 instruction set on wendy.
There's an issue NixOS/nixpkgs#115425 that's waiting on Nix release, but that's going to take a while to be usable.
Since wendy is supposed to be retired anyway later this year and given that it won't play a huge role in terms of Linux workload, I'd suggest retiring it.
The alternative is to make it only run big-parallel and tests features.
Note that this is currently a big blocker for data science packages and it provides frustration as packages break all the time.
I'm happy to do the work myself if I can help in any way.
After #68, this is the new list of name servers to be switch to at united domains:
ns-1455.awsdns-53.org
ns-1875.awsdns-42.co.uk
ns-483.awsdns-60.com
ns-998.awsdns-60.net
There is the DNS server config and there should be a separate place where just the nameservers can be switched away from the registrar-hosted DNS.
From the hydra.nixos.org logs:
Jul 28 13:09:44 ceres hydra-queue-runner[9427]: possibly transient failure building ‘/nix/store/4k1353pypg9jzq0nsmchq7w3rl0s2bg9-nixpkgs-metrics.drv’ on ‘[email protected]’: error: --- Error --- hydra-queue-runner
Jul 28 13:09:44 ceres hydra-queue-runner[9427]: cannot connect to ‘[email protected]’: ssh: connect to host t2a.cunat.cz port 22: No route to host
@vcunat Do you know what's up with this machine?
Those are managed by Terraform now.
We have started to use "Repology" data in Nixpkgs through nix-update. Repology takes the .json file provided by nixos.org and finds outdated packages. It updates hourly.
This works well, but the data is frequently out of date:
$ curl -I https://nixos.org/nixpkgs/packages-unstable.json.gz
HTTP/1.1 200 OK
Date: Sun, 25 Mar 2018 03:22:34 GMT
Server: Apache/2.4.29 (Unix) OpenSSL/1.0.2n
Strict-Transport-Security: max-age=15552000
Last-Modified: Wed, 07 Mar 2018 16:00:44 GMT
ETag: "13c573-566d4a9650dba"
Accept-Ranges: bytes
Content-Length: 1295731
Content-Type: application/json
Content-Encoding: x-gzip
Can we set this to update more frequently? It's currently more than 2 weeks old, giving us bad data from Repology.
Channel releases are happening in systemd unit timers.
It would be good to move that task away from the nixos.org webserver, just in case nginx has an exploitable security hole.
Another reason is that channel releases can OOM, which could kill the webserver unecessarily. If a channel releases fails it's retries anyways.
As introduced per nix-community/nixops-gce#1 , we can switch from creating our own bootstrap-image per deployment to having a public image in nixos GCP account available for us, similarly to amis in EC2
We then need to update the https://github.com/NixOS/nixpkgs/blob/master/nixos/modules/virtualisation/gce-images.nix file while specifying the image family and the project name
Here are the steps
Building an image from source
$ gcloud compute images create nixos-18091228a4c4cbb613c-x86-64-linux \
--source-uri gs://nixos-cloud-images/nixos-image-18.09.1228.a4c4cbb613c-x86_64-linux.raw.tar.gz \
--family=nixos-1809
Making the image public
$ gcloud compute images add-iam-policy-binding nixos-18091228a4c4cbb613c-x86-64-linux \
--member='allAuthenticatedUsers' \
--role='roles/compute.imageUser'
And then that image may be used publicly so that all nixops users won't have the need to provision their own 'bootstrap-image' resource for every deployment.
$ gcloud compute instances create test-nixos-18 \
--image-family=nixos-1809 \
--zone=europe-west1-c \
--image-project=predictix-operations
Hi,
I wanted to download an image for my rpi3 from hydra and noticed that hydra seems to be having some issues.
Acording to the grafana board all services are failing.
# journalctl -u update-nixos-20.09-small.service
....
Oct 05 11:40:16 bastion update-nixos-20.09-small-start[18899]: $ index-debuginfo /scratch/hydra-mirror/nixos-files.sqlite s3://nix-cache /scratch/hydra-mirror/release-nixos-20.09-small/nixos-20.09beta977.ad3a5d5092e/store-paths
Oct 05 11:40:16 bastion update-nixos-20.09-small-start[27673]: error: --- Error --- index-debuginfo
Oct 05 11:40:16 bastion update-nixos-20.09-small-start[27673]: don't know how to open Nix store 's3://nix-cache'
Oct 05 11:40:16 bastion update-nixos-20.09-small-start[18899]: Command failed with code (1) errno (0).
Tomorrow at 14:00 America/New_York we'll be migrating Hydra's database from Postgresql 11 to Postgresql 12.
Motivation: Let us use log_transaction_sample_rate
and other improvements in Postgresql 12. Once we upgrade to 21.05, I'll want to get us to 13 pretty soon after to let us use sample-based slow query logging.
How to:
systemctl stop postgresql.service
zfs snapshot rpool/safe/postgres@postgres-11-to-12-migration-pre
diff --git a/delft/haumea.nix b/delft/haumea.nix
index b94676f..ece5b8d 100644
--- a/delft/haumea.nix
+++ b/delft/haumea.nix
@@ -84,7 +84,7 @@
services.postgresql = {
enable = true;
- package = pkgs.postgresql_11;
+ package = pkgs.postgresql_12;
dataDir = "/var/db/postgresql";
# https://pgtune.leopard.in.ua/#/
settings = {
nixops deploy -d buildfarm --include haumea --dry-activate
oldpg=$(nix-build -I nixpkgs=channel:nixos-20.09-small -E '(import <nixpkgs> {}).postgresql_11')
newpg=$(nix-build -I nixpkgs=channel:nixos-20.09-small -E '(import <nixpkgs> {}).postgresql_12')
cd /var/db/postgresql
mkdir old
chmod 0700 old
mv ./* old || true
mkdir new
"$newpg/bin/initdb" -U root --locale=en_US.UTF-8 --encoding UTF8 ./new
"${newpg}/bin/pg_upgrade" \
--old-bindir="${oldpg}/bin/" \
--new-bindir="${newpg}/bin/" \
--old-datadir "./old" \
--new-datadir "./new" \
--link \
--user root \
--verbose
rm -rf old
mv ./new/* .
zfs snapshot rpool/safe/postgres@postgres-11-to-12-migration-post
nixops deploy -d buildfarm --include haumea
Postgresql should be started now.
--link
option uses hardlinks to copy the data files, and only system tables are rewritten during the migration.zfs rollback rpool/safe/postgres@postgres-11-to-12-migration-pre
Here is the result of running these steps on a clone of production data:
[grahamc@kif:/hydra/scratch/haumea-hack/target]$ oldpg=$(nix-build -I nixpkgs=channel:nixos-20.09-small -E '(import <nixpkgs> {}).postgresql_11')
[grahamc@kif:/hydra/scratch/haumea-hack/target]$ newpg=$(nix-build -I nixpkgs=channel:nixos-20.09-small -E '(import <nixpkgs> {}).postgresql_12')
[grahamc@kif:/hydra/scratch/haumea-hack/target]$ mkdir old
[grahamc@kif:/hydra/scratch/haumea-hack/target]$ chmod 0700 old
[grahamc@kif:/hydra/scratch/haumea-hack/target]$ mv ./* old || true
mv: cannot move './old' to a subdirectory of itself, 'old/old'
[grahamc@kif:/hydra/scratch/haumea-hack/target]$
[grahamc@kif:/hydra/scratch/haumea-hack/target]$ mkdir new
[grahamc@kif:/hydra/scratch/haumea-hack/target]$ "$newpg/bin/initdb" -U root ./new
The files belonging to this database system will be owned by user "grahamc".
This user must also own the server process.
The database cluster will be initialized with locale "en_US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
Data page checksums are disabled.
fixing permissions on existing directory ./new ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... UTC
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok
initdb: warning: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.
Success. You can now start the database server using:
/nix/store/140ag1560jjjqli2daz0d7cwwxbsa4ra-postgresql-12.6/bin/pg_ctl -D ./new -l logfile start
[grahamc@kif:/hydra/scratch/haumea-hack/target]$ "${newpg}/bin/pg_upgrade" \
> --old-bindir="${oldpg}/bin/" \
> --new-bindir="${newpg}/bin/" \
> --old-datadir "./old" \
> --new-datadir "./new" \
> --link \
> --user root \
> --verbose
Running in verbose mode
Performing Consistency Checks
-----------------------------
Checking cluster versions ok
Current pg_control values:
[...]
Values to be changed:
First log segment after reset: 000000010000166F000000AD
[...]
Values to be changed:
First log segment after reset: 000000010000000000000002
[... a lot of linking and queries ...]
Upgrade Complete
----------------
Optimizer statistics are not transferred by pg_upgrade so,
once you start the new server, consider running:
./analyze_new_cluster.sh
Running this script will delete the old cluster's data files:
./delete_old_cluster.sh
[grahamc@kif:/hydra/scratch/haumea-hack/target]$ ${newpg}/bin/pg_ctl -D ./ -o "-F -k \"/tmp\"" -w start -l ./log
waiting for server to start.... done
server started
[grahamc@kif:/hydra/scratch/haumea-hack/target]$ cat log
2021-03-09 02:15:32.961 UTC [9353] LOG: starting PostgreSQL 12.6 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 9.3.0, 64-bit
2021-03-09 02:15:32.961 UTC [9353] LOG: listening on IPv6 address "::1", port 5432
2021-03-09 02:15:32.961 UTC [9353] LOG: listening on IPv4 address "127.0.0.1", port 5432
2021-03-09 02:15:32.962 UTC [9353] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2021-03-09 02:15:32.969 UTC [9354] LOG: database system was shut down at 2021-03-09 02:10:25 UTC
2021-03-09 02:15:32.971 UTC [9353] LOG: database system is ready to accept connections
I'm also interested in monitoring Hydra instances using a prometheus exporter, however I'm rather hesitant to simply include delft/prometheus/hydra-queue-runner-reexporter.py
from here without having releases and probably sudden changes (as breaking changes aren't a new major here).
I'd suggest to create a new repository (such as nixos/hydra-prometheus-exporter
) and maintain the tool there.
In case we'd do this, I'd also help maintaining the repo and package.
A number of people asked me when we were updating the images to 19.09. Opening this issue to track it.
Oct 20 13:18:21 ceres hydra-queue-runner[10314]: possibly transient failure building ‘/nix/store/vlpwymfgjw6ankg906aqkjfsb30yk41a-nixpkgs-metrics.drv’ on ‘[email protected]’: error: --- Error --- hydra-queue-runner
Oct 20 13:18:21 ceres hydra-queue-runner[10314]: cannot connect to ‘[email protected]’: ssh: connect to host t2a.cunat.cz port 22: No route to host
This causes a few seconds delay for every SSH connection since hydra.nixos.org
tries to connect via IPv6 first, e.g.
$ ssh -i /var/lib/hydra/queue-runner/.ssh/id_buildfarm_rsa -v [email protected]
OpenSSH_7.9p1, OpenSSL 1.0.2r 26 Feb 2019
debug1: Reading configuration data /root/.ssh/config
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 84: Applying options for *
debug1: auto-mux: Trying existing master
debug1: Control socket "/root/.ssh/control-d2410326.packethost.net-22-root" does not exist
debug1: Connecting to d2410326.packethost.net [2604:1380:2000:be00::1] port 22.
(...delay...)
debug1: connect to address 2604:1380:2000:be00::1 port 22: No route to host
debug1: Connecting to d2410326.packethost.net [147.75.100.189] port 22.
debug1: Connection established.
Currently https://gist.github.com/grahamc/df1bb806eb3552650d03eef7036a72ba is linked as a diagnostics script for cache/CDN issues from https://cache.nixos.org/, but it's still geared towards cloudfront instead of Fastly.
I couldn't find any similar script provided by Fastly directly, so am not sure if there are any other particular domains or hosts that need checking aside from cache.nixos.org itself, thus am not proposing a concrete change here.
Hydra is down. Looks to be database related.
DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=hydra;host=10.254.1.9;user=hydra;','',...) failed: could not connect to server: Connection refused
Is the server running on host "10.254.1.9" and accepting
TCP/IP connections on port 5432? at /nix/store/rbx4maml998p56s3ilfwr6xz82bgd3vy-hydra-perl-deps/lib/perl5/site_perl/5.32.0/DBIx/Class/Storage/DBI.pm line 1517. at /nix/store/0k21rc23irjqhn6y8ahjqbb998jdgakh-hydra-0.1.20210202.bc12fe1/libexec/hydra/lib/Hydra/Helper/CatalystUtils.pm line 420
I used to be able to stop and restart jobs on hydra.nixos.org
. I also used to be able to move builds to the front of the queue. A couple of weeks ago, however, I lost those privileges. Is there any particular reason why those abilities were removed from my user?
The bigmac-guest
machine keeps eating builds in this style:
unpacking sources
unpacking source archive /nix/store/xklry47z0gxb3gc4697hjzj5imnadvcv-rust-1.30.1-x86_64-apple-darwin.tar.gz
tar: Skipping to next header
gzip: stdin: unexpected end of file
tar: Child returned status 1
tar: Error is not recoverable: exiting now
do not know how to unpack source archive /nix/store/xklry47z0gxb3gc4697hjzj5imnadvcv-rust-1.30.1-x86_64-apple-darwin.tar.gz
builder for '/nix/store/560apllzcpnglhsrb168v3lv2774wbmb-rustc-bootstrap-1.30.1.drv' failed with exit code 1
I've tried some of the archives and they're fine. Some other machines completed some of the jobs fine when restarted. /cc @grahamc @copumpkin
I'm not sure about the people to ping or even if this is the best channel to open this thread.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.