nixos / infra Goto Github PK

NixOS configurations for nixos.org and its servers

License: MIT License

Nix 37.67% Shell 3.01% Python 10.21% HTML 0.91% HCL 18.73% Smarty 29.47%

infra's Issues

Manage all build nodes under one network

Right now there are a few groups of managed build nodes. This makes it difficult to ensure they're all running consistent versions of Nix and NixOS, and are participating in the monitoring.

Build machine appears to run out of memory building LLVM

I noticed that LLVM wasn't available in the binary cache, but built fine on machine. It looks like the build machine is running out of memory:

g++: fatal error: Killed signal terminated program cc1plus

See https://hydra.nixos.org/build/110563745

CSS broken on status.nixos.org

https://status.nixos.org/

Please increase `launchctl limit maxfiles` for Darwin builders

I noticed this because Anki fails to build in the nixpkgs-20.03-darwin jobset with OSError: [Errno 24] Too many open files.

The same derivation can build locally with a higher limit. There's a sample LaunchDaemon to configure this here: https://eradman.com/entrproject/limits.html

I'm not sure what the consequences of this for virtualization are (like, does the guest use the host's file limit).

cc @grahamc because you wrote the README for the MacOS infrastructure 🙏

hydra-queue-runner connection issues on hydra.nixos.org

I was trying to figure out why nixpkgs-unstable wasn't updating a few days, and stumbled upon this page: https://hydra.nixos.org/build/139583183

It's filled with a bunch of messages like:

--- Error --- hydra-queue-runner�[0m cannot connect to ‘�[33;[email protected]�[0m’: �[33;1m@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: POSSIBLE DNS SPOOFING DETECTED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ The ED25519 host key for c5517495.packethost.net has changed, and the key for the corresponding IP address 2604:1380:2001:2000::d is unknown. This could either mean that DNS SPOOFING is happening or the IP address for the host and its host key have changed at the same time. @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that a host key has just been changed. The fingerprint for the ED25519 key sent by the remote host is SHA256:MEs3I20z6zLDCoRXxDwb41ivxoR+o1a+O5HHE6t6dmc. Please contact your system administrator. Add correct host key in /tmp/nix-29595-575205/host-key to get rid of this message. Offending ED25519 key in /tmp/nix-29595-575205/host-key:1 ED25519 host key for c5517495.packethost.net has changed and you have requested strict checking. Host key verification failed.�[0m

It also has some successful builds, so maybe it's a transient issue.

AMIs for 21.05

Creating this issue based on this Discourse thread.

Ike has been failing builds transiently a few times now, including in the middle of a webkit build that didn't use network access or anything, meaning that the issue appears to be RAM or disk-related.

And this doesn't sound good, because if the error occurs in eg. the final build-product, it may actually pass the compilation step and generate a broken binary.

In addition, these failing builds slow down the channel, as they need to be manually restarted for the channel to move forward.

I understand that ike is currently one of the most powerful machines in hydra, but maybe it'd make sense to try to disable it and see whether channel actually bump faster? (as there should be less transient failures)

Deriver field missing in cache.nixos.org .narinfo files

This breaks the nix log command.

Example:

$ nix-store -curl https://cache.nixos.org/7k3vlmpvzjqrc7lbmz1csbhg5d1rn4fw.narinfo
StorePath: /nix/store/7k3vlmpvzjqrc7lbmz1csbhg5d1rn4fw-glibc-2.31
URL: nar/1236ysxagw0gv01cw4i4dsbm257lzpq8mxkpjzmi6pg7p0nqb0ac.nar.xz
Compression: xz
FileHash: sha256:1236ysxagw0gv01cw4i4dsbm257lzpq8mxkpjzmi6pg7p0nqb0ac
FileSize: 6401092
NarHash: sha256:12k2hciiibf4x68957x7234kbykw0szpq9gidn10qvpxckm7rm0l
NarSize: 30519536
References: 7k3vlmpvzjqrc7lbmz1csbhg5d1rn4fw-glibc-2.31 z1sxk8d5z9cn89pv46h800lkqjl22g67-libidn2-2.3.0
Sig: cache.nixos.org-1:+Vr1hG9RWRRZrBlI2u/ZPA7sfMkF2PISLyiP/sa+zMrjm1TjxDVqkCrh48nWLTipYISPg3QPL5LjL23xi9kNAQ==

whereas previously:

$ curl https://cache.nixos.org/5ka41zhii1bjss3f60rzd2npz9mxj060.narinfo
StorePath: /nix/store/5ka41zhii1bjss3f60rzd2npz9mxj060-glibc-2.27
URL: nar/0rjjy0l8a5z7ajk4zprvnwqbmcgaqlnpsd2v80kdb3frlw9gkgh8.nar.xz
Compression: xz
FileHash: sha256:0rjjy0l8a5z7ajk4zprvnwqbmcgaqlnpsd2v80kdb3frlw9gkgh8
FileSize: 6152544
NarHash: sha256:1wj2wa8hkb792q9qp6rck87315jldixjp7kaw1rz0r7sh09kwblx
NarSize: 27158944
References: 5ka41zhii1bjss3f60rzd2npz9mxj060-glibc-2.27
Deriver: nq8gcjq461ijxj2s28xjikyjan6xyq90-glibc-2.27.drv
Sig: cache.nixos.org-1:1J0Aafo5T+O/W032Tz1t/ql7UO1fOCu3PWHpwS6KGATWN7nmn9DJXo9esu1ir7Nnfii7cuZ1xpZblQ0IElhdDg==

irc logs no longer working

@rbvermaa are you handling this service? http://nixos.org/irc/logs/
The IRC logs no longer work, last one was in July.
The user mog on #nixos has logs for that time period.

broken big-parallel feature on some Hydra machines

Apparently for some machines the scheduler thinks they have the "big-parallel" feature but the builder itself disagrees. We're then getting abortions like

Aborted: �[31;1merror:�[0m�[34;1m --- Error --- nix-store�[0m a '�[33;1maarch64-linux�[0m' with features {�[33;1mbig-parallel�[0m} is required to build '�[33;1m/nix/store/kjvs406gd1gxa90fz614krnasv3p8wym-llvm-9.0.1.drv�[0m', but I am a '�[33;1maarch64-linux�[0m' with features {�[33;1mkvm, nixos-test, recursive-nix�[0m}

A current case was aarch64 but maybe I saw x86_64 kernel a couple days ago.

Clean up domain list

Here is a rough sorting of domains to drop / keep, based on what just redirected back to nixos.org, domains I'd never heard of before, or were showing a "website maintenance" page, plus a couple I have heard of but nobody uses:

Drop:
ts.nixos.org
svn.nixos.org
status.nixos.org * never updated since I came around ...
stan.nixos.org
releases.nixos.org * I wonder how much traffic this gets
mturk.nixos.org
monitor.nixos.org
lucifer.nixos.org
losser.nixos.org
barbrady.nixos.org

Keep:
cache.nixos.org
hydra.nixos.org
weekly.nixos.org
conf.nixos.org
tarballs.nixos.org
planet.nixos.org

Domain list from #33 (comment)

Move nixos.org to S3

This will allow us to get rid of the EC2 web server.

monitoring.nixos.org SSL certificate expired

Despite services.nginx.virtualHosts."monitoring.nixos.org".enableACME being set, the certificate expired yesterday (2021-03-18). This has a knock-on effect with status.nixos.org no longer working (see also NixOS/nixos-status#9).

Channel updates are broken by AWS + Terraform changes

I'm sure @zimbatm would like to handle this one :)

Dec 20 03:22:24 webserver systemd[1]: Started Update Channel nixos-18.09.
Dec 20 03:22:29 webserver update-nixos-18.09-start[9256]: release is ‘nixos-18.09.1761.9bacb8289bb’ (build 86094175), eval is 1496550, prefix is nixos/18.09/nixos-18.09.1761.9bacb8289bb, Git commit is 9bacb8289bbd401988d94aacea83efbe225ebc1a
Dec 20 03:22:29 webserver update-nixos-18.09-start[9256]: Net::Amazon::S3: Amazon responded with 403 Forbidden
Dec 20 03:22:29 webserver update-nixos-18.09-start[9256]:  at /nix/store/5ajpkdh6byxjfayn97q2xksdii71w5m6-perl-Net-Amazon-S3-0.80/lib/perl5/site_perl/5.24.3/Net/Amazon/S3/Bucket.pm line 151.
Dec 20 03:22:29 webserver systemd[1]: update-nixos-18.09.service: Main process exited, code=exited, status=255/n/a
Dec 20 03:22:29 webserver systemd[1]: update-nixos-18.09.service: Unit entered failed state.
Dec 20 03:22:29 webserver systemd[1]: update-nixos-18.09.service: Failed with result 'exit-code'.

Tarball mirroring broken

# systemctl status mirror-tarballs | cat
● mirror-tarballs.service - Mirror Nixpkgs Tarballs
   Loaded: loaded (/nix/store/184y8myyyw134ri16d844ccwm17wxfkg-unit-mirror-tarballs.service/mirror-tarballs.service; linked; vendor preset: enabled)
   Active: failed (Result: exit-code) since Tue 2019-01-22 05:31:17 CET; 5h 42min ago
  Process: 2448 ExecStart=/nix/store/nrv6r883198m3s9cf49v4xjl5haclcwc-unit-script-mirror-tarballs-start (code=exited, status=1/FAILURE)
 Main PID: 2448 (code=exited, status=1/FAILURE)

Jan 22 05:31:13 bastion nrv6r883198m3s9cf49v4xjl5haclcwc-unit-script-mirror-tarballs-start[2448]: trace: stdenv.isArm is deprecated after 18.03
Jan 22 05:31:13 bastion nrv6r883198m3s9cf49v4xjl5haclcwc-unit-script-mirror-tarballs-start[2448]: trace: stdenv.isArm is deprecated after 18.03
Jan 22 05:31:17 bastion nrv6r883198m3s9cf49v4xjl5haclcwc-unit-script-mirror-tarballs-start[2448]: GC Warning: Failed to expand heap by 8388608 bytes
Jan 22 05:31:17 bastion nrv6r883198m3s9cf49v4xjl5haclcwc-unit-script-mirror-tarballs-start[2448]: GC Warning: Failed to expand heap by 65536 bytes
Jan 22 05:31:17 bastion nrv6r883198m3s9cf49v4xjl5haclcwc-unit-script-mirror-tarballs-start[2448]: GC Warning: Out of Memory! Heap size: 3690 MiB. Returning NULL!
Jan 22 05:31:17 bastion nrv6r883198m3s9cf49v4xjl5haclcwc-unit-script-mirror-tarballs-start[2448]: error: out of memory
Jan 22 05:31:17 bastion nrv6r883198m3s9cf49v4xjl5haclcwc-unit-script-mirror-tarballs-start[2448]: ./maintainers/scripts/copy-tarballs.pl: evaluation failed
Jan 22 05:31:17 bastion systemd[1]: mirror-tarballs.service: Main process exited, code=exited, status=1/FAILURE
Jan 22 05:31:17 bastion systemd[1]: mirror-tarballs.service: Failed with result 'exit-code'.
Jan 22 05:31:17 bastion systemd[1]: Failed to start Mirror Nixpkgs Tarballs.

Add third-party status monitoring service

In the event that all our infrastructure goes down, we want to be notified as soon as possible.

It could be that our domain has expired.
It could be that one of our partner decided to pull the plug.
It could be that we got hacked really bad.

In any case, it would be nice to receive a SMS if that was the case.

Monitoring / logs for website build

It would be easier to figure out and fix issues like nixos-homepage#232 and #41 with log access.

Monitoring the website build, even for simple "hey, look, I failed", would at least let us act more quickly into figuring out a solution. A month is kinda slow :/.

(I have no actual knowledge or propositions for this.)

Channel updates happen all at once, meaning none at all.

Instead, let's consider running them sequentially. The memory pressure on the instance is too high for it to get two done at once.

Users are reporting some issues with cache.nixos.org - gather information here

Issue to centralize some of the issues some users are having with cache.nixos.org after the Fastly.com switch.

mac1, mac9 broken

Oct 20 13:17:45 ceres hydra-queue-runner[10314]: possibly transient failure building ‘/nix/store/gyq9h07h6iph8fdy9pa873w720zwkcm0-ghc-8.11.20200824.drv’ on ‘root@mac1-guest’: error: --- Error --- hydra-queue-runner
Oct 20 13:17:45 ceres hydra-queue-runner[10314]: cannot connect to ‘root@mac1-guest’: [email protected]: Permission denied (publickey).
Oct 20 13:18:47 ceres hydra-queue-runner[10314]: possibly transient failure building ‘/nix/store/gyq9h07h6iph8fdy9pa873w720zwkcm0-ghc-8.11.20200824.drv’ on ‘root@mac9-guest’: error: --- Error --- hydra-queue-runner
Oct 20 13:18:47 ceres hydra-queue-runner[10314]: cannot connect to ‘root@mac9-guest’: kex_exchange_identification: Connection closed by remote host

Hydra-provisioner machines don;'t have Nix build-cores set

build-cores = 0 Nix config is missing for EC2 provisioned machines, I don't see they include delft/common.nix. This is why chromium takes 8h to compile, in reality it should take 40-60min (plus the load).

nixos.org doesn't redirect to https

When I type nixos.org in browser URL, it doesn't redirect to https://nixos.org
I think, this should be done for, at least, http://nixos.org/nixos/security.html page.

update AWS AMIs and Virtual Box Images to 20.03

Now that 20.03 is out (congrats!), it would be great to have official AMIs out there as well, otherwise this page looks broken:

https://nixos.org/download.html (tab Amazon EC2).

In this discussion on Discourse @grahamc mentions that maybe release documentation needs updating to include building AMIs as the part of the release.

cc @AmineChikhaoui @worldofpeace

Serve signed packages on the cache

See NixOS/nix#75 (comment), Hydra already signs packages, so the cache would need to copy the signatures and make signatures for already-made packages.

Ipv6 support for https://nixos.org

Issue description

According to this document https://aws.amazon.com/blogs/aws/aws-ipv6-update-global-support-spanning-15-regions-multiple-aws-services/ there is ipv6 support for eu-west-1, where the website is apparently hosted. When we also have ipv6 support for the homepage, ipv6 support is complete.

Steps to reproduce

$ dig AAAA nixos.org

; <<>> DiG 9.10.4-P6 <<>> AAAA nixos.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 27168
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;nixos.org.                     IN      AAAA

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Tue Apr 25 10:03:41 CEST 2017
;; MSG SIZE  rcvd: 38

Things to solve this issue:

Add ipv6 support for AWS in nixops
Allocate an ipv6 subnet in nixos-homepage

Use Amazon Certificate Manager to provision certificates for CloudFront distributions

Currently the certificate for cache.nixos.org is not recognized by some browsers and is manually provisioned. So it would be better to use ACM.

Install a multi-user IRC bouncer

To make it as easy as possible for users to connect to the community, it would be great if nixos.org would serve a web interface with history for the IRC channel where people can use IRC without a client.

A good candidate seem to be https://github.com/ircanywhere/ircanywhere .

Missing AMI for region ap-east-1

The region is in the AMI upload script but not in the AMI list https://github.com/NixOS/nixpkgs/blob/master/nixos/modules/virtualisation/ec2-amis.nix.

cc @AmineChikhaoui

status.nixos.org: 19.09 should be marked as EOL

I think that's achieved by flipping metric.current in prometheus: https://github.com/NixOS/nixos-org-configurations/blob/3294705/delft/eris/status-page/status.js#L120

status.nixos.org broken

Currently displays "loading from Prometheus" forever, Prometheus API redirecting to 0.0.0.0.

(@grahamc is already aware, just filing this as an issue as well)

Broken IPv6 PMTUD for nixos.org

It seems that something is dropping ICMP packet too big messages along the path, breaking path MTU discovery:

$ ping -6 nixos.org -s 1444
PING nixos.org(2a05:d014:275:cb00:ec0d:12e2:df27:aa60 (2a05:d014:275:cb00:ec0d:12e2:df27:aa60)) 1444 data bytes
1452 bytes from 2a05:d014:275:cb00:ec0d:12e2:df27:aa60 (2a05:d014:275:cb00:ec0d:12e2:df27:aa60): icmp_seq=1 ttl=48 time=16.1 ms
^C
$ ping -6 nixos.org -s 1445
PING nixos.org(2a05:d014:275:cb00:ec0d:12e2:df27:aa60 (2a05:d014:275:cb00:ec0d:12e2:df27:aa60)) 1445 data bytes
^C

Here's a tracepath:

 1?: [LOCALHOST]                        0.005ms pmtu 1500
 1:  redacted.dip.versatel-1u1.de   0.660ms 
 1:  redacted.dip.versatel-1u1.de   0.437ms 
 2:  redacted.dip.versatel-1u1.de   0.379ms pmtu 1492
 2:  redacted                                             1.384ms 
 3:  redacted                               8.840ms 
 4:  2001:1438:0:1::5:1f2                                  9.661ms 
 5:  2001:1438:0:1::5:72                                  14.331ms 
 6:  fra1-edge1.digitalocean.com                          15.400ms 
 7:  2604:a880:ffff:5::43a                                35.559ms 
 8:  no reply
 9:  no reply
10:  no reply
11:  no reply

Here's hydra.nixos.org for comparison:

 1?: [LOCALHOST]                        0.013ms pmtu 1500
 1:  redacted.dip.versatel-1u1.de   0.486ms 
 1:  redacted.dip.versatel-1u1.de   2.604ms 
 2:  redacted.dip.versatel-1u1.de   0.371ms pmtu 1492
 2:  redacted                                             1.084ms 
 3:  redacted                              14.956ms 
 4:  2001:1438:0:1::5:1f2                                  7.794ms 
 5:  2001:1438:0:1::5:72                                  14.621ms 
 6:  2a01:4f8:0:e0f0::29                                  14.935ms 
 7:  core1.fra.hetzner.com                                20.925ms 
 8:  core23.fsn1.hetzner.com                              20.273ms 
 9:  ex9k1.dc5.fsn1.hetzner.com                           19.310ms 
10:  2a01:4f8:140:244c::                                  19.098ms reached
     Resume: pmtu 1492 hops 10 back 10

Re-enable email notifications

To do this, we should:

edit the nixpkgs release docs to indicate disable email sending for the first few evaluations, and a note to follow-up to turn them back on

and then, in order:

make sure Hydra's notification queue and the mail queue haven't collected 1yr of email notifications
if there are any queued emails (in hydra or on the server), drop them
re-enable mail on the server

original contents below

As recently discussed with @vcunat and others on IRC, I think the hydra email notifications are very valuable and should be re-enabled. They were disabled in march after some spam issues. However without those, as a maintainer its hard to actually maintain a package (e.g. react to failures).

IRC logs:

12:13 <timokau[m]> Whats the status of hydra emailing maintainers? Was that just never re-activated after the spam? Or was the issue never fixed? 
12:23 <vcunat> Never reactivated AFAIK. 
12:24 <ekleog> oh :/ 
12:25 <timokau[m]> Can we just do that, or is there some work involved? 
12:27 <vcunat> Here's the line 
12:27 <vcunat> https://github.com/NixOS/nixos-org-configurations/blame/1dfde8a7cc461cacf338250a8a2eb53b0e0bd72c/delft/chef.nix#L54 
12:27 <vcunat> I don't know how Hydra's mail works on the inside. (e.g. if it will try to re-send those mails or something) 
12:27 <vcunat> niksnut: ^^ 
12:39 <niksnut> yeah it's disabled 
12:41 <vcunat> and expected not to cause trouble if simply re-enabled? 
12:42 <vcunat> My guess would be that the problem was that the *first* evaluation happenned with the feature on. On subsequent evaluations I'd expect only status changes would be e-mailed, but I might easily be wrong. 
12:48 <niksnut> IMHO email notification is not really worth it 
12:48 <niksnut> it causes more problems than it's worth, and most users don't care for it 
13:10 <LnL> can people without an account access the maintainers page on hydra? 
13:11 <LnL> doesn't look like it https://hydra.nixos.org/dashboard/[email protected]#tabs-my-jobs 
13:13 <LnL> also not everything is in there 
13:16 <LnL> oh, meta.maintainers is broken on hyra Maintainer(s):not given 
13:16 <LnL> niksnut: ^ 
13:36 <timokau[m]> vcunat: niksnut: I think email notifications are very much worth it. How else are maintainers supposed to notice that their packages break? I think it is a very important step in minimizing hydra failures. 
13:40 <vcunat> timokau[m]: yes, I don't know a better way ATM. 
13:40 <vcunat> Most maintainers didn't react to the e-mails apparently, but if working reasonably reliably, the feature would seem a nice to have. 
13:44 <timokau[m]> Yes and it was working reliably until the spam. If that is still a concern, maybe some stupid rate limiting would reduce the risk. Or worst case we could at least make it possible to opt-in. 
13:44 <vcunat> I occasionally did get some weird messages for it for builds that were months old. 
13:44 <vcunat> s/for it/from it/ 
13:47 <aminechikhaoui> vcunat: I saw that also in our private hydra, I think it has to do with the attempted fix here but not sure https://github.com/NixOS/hydra/pull/566 
13:48 <aminechikhaoui> but it basically happens every time we restart the queue runner 
13:48 <vcunat> well first we need to fix filling the maintainer colon, as without that data there won't be anyone to send to 
13:49 <vcunat> (except for those messages: "your commit may have broken this build") 
13:50 <vcunat> Eh, not "colon", but I guess you know what I mean :-) 
13:51 <timokau[m]> In my opinion a few false-positives would be better than no positives at all :) 
13:52 <timokau[m]> I didn't know there was also that kind of message. Aren't that usually a lot of commits? 
14:14 <vcunat> It certainly happened commonly that there were many. 
14:14 <vcunat> > This may be due to 640 commits by ... (long list of authors) 
14:19 <timokau[m]> Those messages should probably be disabled. Or even better only sent if up to X commits might be responsible. 
14:32 <timokau[m]> And the problem with maintainers is that the parsing was just never adapted to the new maintainers format? 
14:54 <vcunat> It's possible. I don't know if anything was attempted.

Remove wendy from the machine list

We have a bunch of packages that are failing to build due to missing SSE 4.2 instruction set on wendy.

There's an issue NixOS/nixpkgs#115425 that's waiting on Nix release, but that's going to take a while to be usable.

Since wendy is supposed to be retired anyway later this year and given that it won't play a huge role in terms of Linux workload, I'd suggest retiring it.

The alternative is to make it only run big-parallel and tests features.

Note that this is currently a big blocker for data science packages and it provides frustration as packages break all the time.

I'm happy to do the work myself if I can help in any way.

Switch the nixos.org nameservers

After #68, this is the new list of name servers to be switch to at united domains:

ns-1455.awsdns-53.org
ns-1875.awsdns-42.co.uk
ns-483.awsdns-60.com
ns-998.awsdns-60.net

There is the DNS server config and there should be a separate place where just the nameservers can be switched away from the registrar-hosted DNS.

t2a.cunat.cz unreachable

From the hydra.nixos.org logs:

Jul 28 13:09:44 ceres hydra-queue-runner[9427]: possibly transient failure building ‘/nix/store/4k1353pypg9jzq0nsmchq7w3rl0s2bg9-nixpkgs-metrics.drv’ on ‘[email protected]’: error: --- Error --- hydra-queue-runner
Jul 28 13:09:44 ceres hydra-queue-runner[9427]: cannot connect to ‘[email protected]’: ssh: connect to host t2a.cunat.cz port 22: No route to host

@vcunat Do you know what's up with this machine?

Remove the bucket policies from nixos-org/network.nix

Those are managed by Terraform now.

Update https://nixos.org/nixpkgs/packages-unstable.json.gz more frequently

We have started to use "Repology" data in Nixpkgs through nix-update. Repology takes the .json file provided by nixos.org and finds outdated packages. It updates hourly.

This works well, but the data is frequently out of date:

$ curl -I https://nixos.org/nixpkgs/packages-unstable.json.gz
HTTP/1.1 200 OK
Date: Sun, 25 Mar 2018 03:22:34 GMT
Server: Apache/2.4.29 (Unix) OpenSSL/1.0.2n
Strict-Transport-Security: max-age=15552000
Last-Modified: Wed, 07 Mar 2018 16:00:44 GMT
ETag: "13c573-566d4a9650dba"
Accept-Ranges: bytes
Content-Length: 1295731
Content-Type: application/json
Content-Encoding: x-gzip

Can we set this to update more frequently? It's currently more than 2 weeks old, giving us bad data from Repology.

Move channel release out of webserver

Channel releases are happening in systemd unit timers.

It would be good to move that task away from the nixos.org webserver, just in case nginx has an exploitable security hole.

Another reason is that channel releases can OOM, which could kill the webserver unecessarily. If a channel releases fails it's retries anyways.

Prepare public images in the NixOS GCP account

As introduced per nix-community/nixops-gce#1 , we can switch from creating our own bootstrap-image per deployment to having a public image in nixos GCP account available for us, similarly to amis in EC2

We then need to update the https://github.com/NixOS/nixpkgs/blob/master/nixos/modules/virtualisation/gce-images.nix file while specifying the image family and the project name

Here are the steps

Building an image from source

$ gcloud compute images create nixos-18091228a4c4cbb613c-x86-64-linux  \
              --source-uri gs://nixos-cloud-images/nixos-image-18.09.1228.a4c4cbb613c-x86_64-linux.raw.tar.gz \
              --family=nixos-1809

Making the image public

$ gcloud compute images  add-iam-policy-binding nixos-18091228a4c4cbb613c-x86-64-linux \
             --member='allAuthenticatedUsers'                                                                              \
             --role='roles/compute.imageUser'

And then that image may be used publicly so that all nixops users won't have the need to provision their own 'bootstrap-image' resource for every deployment.

$ gcloud compute instances create test-nixos-18  \
            --image-family=nixos-1809                        \
            --zone=europe-west1-c                             \
            --image-project=predictix-operations

cc @AmineChikhaoui

hydra.nixos.org seems to be down.

Hi,
I wanted to download an image for my rpi3 from hydra and noticed that hydra seems to be having some issues.
Acording to the grafana board all services are failing.

https://status.nixos.org/grafana/d/fBW4tL1Wz/scheduled-task-state-channels-website?orgId=1&from=now-12h&to=now&refresh=10s

Channels fail to update, "don't know how to open Nix store 's3://nix-cache'"

# journalctl -u update-nixos-20.09-small.service
....
Oct 05 11:40:16 bastion update-nixos-20.09-small-start[18899]:  $ index-debuginfo /scratch/hydra-mirror/nixos-files.sqlite s3://nix-cache /scratch/hydra-mirror/release-nixos-20.09-small/nixos-20.09beta977.ad3a5d5092e/store-paths
Oct 05 11:40:16 bastion update-nixos-20.09-small-start[27673]: error: --- Error --- index-debuginfo
Oct 05 11:40:16 bastion update-nixos-20.09-small-start[27673]: don't know how to open Nix store 's3://nix-cache'
Oct 05 11:40:16 bastion update-nixos-20.09-small-start[18899]: Command failed with code (1) errno (0).

Scheduled downtime: Postgres 11 to 12

Tomorrow at 14:00 America/New_York we'll be migrating Hydra's database from Postgresql 11 to Postgresql 12.

Motivation: Let us use log_transaction_sample_rate and other improvements in Postgresql 12. Once we upgrade to 21.05, I'll want to get us to 13 pretty soon after to let us use sample-based slow query logging.

How to:

stop the database: systemctl stop postgresql.service
snapshot the dataset: zfs snapshot rpool/safe/postgres@postgres-11-to-12-migration-pre
apply this change:

diff --git a/delft/haumea.nix b/delft/haumea.nix
index b94676f..ece5b8d 100644
--- a/delft/haumea.nix
+++ b/delft/haumea.nix
@@ -84,7 +84,7 @@
 
   services.postgresql = {
     enable = true;
-    package = pkgs.postgresql_11;
+    package = pkgs.postgresql_12;
     dataDir = "/var/db/postgresql";
     # https://pgtune.leopard.in.ua/#/
     settings = {

build the new configuration and stage it on the target server: nixops deploy -d buildfarm --include haumea --dry-activate
migrate the data:

oldpg=$(nix-build -I nixpkgs=channel:nixos-20.09-small -E '(import <nixpkgs> {}).postgresql_11')
newpg=$(nix-build -I nixpkgs=channel:nixos-20.09-small -E '(import <nixpkgs> {}).postgresql_12')

cd /var/db/postgresql
mkdir old
chmod 0700 old
mv ./* old || true

mkdir new
"$newpg/bin/initdb" -U root --locale=en_US.UTF-8 --encoding UTF8 ./new

"${newpg}/bin/pg_upgrade" \
    --old-bindir="${oldpg}/bin/" \
    --new-bindir="${newpg}/bin/" \
    --old-datadir "./old" \
    --new-datadir "./new" \
    --link \
    --user root \
    --verbose

rm -rf old
mv ./new/* .

Snapshot the new dataset: zfs snapshot rpool/safe/postgres@postgres-11-to-12-migration-post
Perform the full deploy now: nixops deploy -d buildfarm --include haumea

Postgresql should be started now.

How long will it take? About 30 seconds to do the actual operation, but let's plan on 30 minutes.
Will this take a lot of disk space? No, the --link option uses hardlinks to copy the data files, and only system tables are rewritten during the migration.
What if something goes wrong? We can roll back: zfs rollback rpool/safe/postgres@postgres-11-to-12-migration-pre

Here is the result of running these steps on a clone of production data:

[grahamc@kif:/hydra/scratch/haumea-hack/target]$ oldpg=$(nix-build -I nixpkgs=channel:nixos-20.09-small -E '(import <nixpkgs> {}).postgresql_11')

[grahamc@kif:/hydra/scratch/haumea-hack/target]$ newpg=$(nix-build -I nixpkgs=channel:nixos-20.09-small -E '(import <nixpkgs> {}).postgresql_12')

[grahamc@kif:/hydra/scratch/haumea-hack/target]$ mkdir old

[grahamc@kif:/hydra/scratch/haumea-hack/target]$ chmod 0700 old

[grahamc@kif:/hydra/scratch/haumea-hack/target]$ mv ./* old || true
mv: cannot move './old' to a subdirectory of itself, 'old/old'

[grahamc@kif:/hydra/scratch/haumea-hack/target]$ 

[grahamc@kif:/hydra/scratch/haumea-hack/target]$ mkdir new

[grahamc@kif:/hydra/scratch/haumea-hack/target]$ "$newpg/bin/initdb" -U root ./new
The files belonging to this database system will be owned by user "grahamc".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory ./new ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... UTC
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

initdb: warning: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.

Success. You can now start the database server using:

    /nix/store/140ag1560jjjqli2daz0d7cwwxbsa4ra-postgresql-12.6/bin/pg_ctl -D ./new -l logfile start


[grahamc@kif:/hydra/scratch/haumea-hack/target]$ "${newpg}/bin/pg_upgrade" \
>     --old-bindir="${oldpg}/bin/" \
>     --new-bindir="${newpg}/bin/" \
>     --old-datadir "./old" \
>     --new-datadir "./new" \
>     --link \
>     --user root \
>     --verbose
Running in verbose mode
Performing Consistency Checks
-----------------------------
Checking cluster versions                                   ok
Current pg_control values:
[...]

Values to be changed:

First log segment after reset:        000000010000166F000000AD
[...]
Values to be changed:

First log segment after reset:        000000010000000000000002

[... a lot of linking and queries ...]

Upgrade Complete
----------------
Optimizer statistics are not transferred by pg_upgrade so,
once you start the new server, consider running:
    ./analyze_new_cluster.sh

Running this script will delete the old cluster's data files:
    ./delete_old_cluster.sh

[grahamc@kif:/hydra/scratch/haumea-hack/target]$ ${newpg}/bin/pg_ctl -D ./ -o "-F -k \"/tmp\"" -w start -l ./log
waiting for server to start.... done
server started

[grahamc@kif:/hydra/scratch/haumea-hack/target]$ cat log
2021-03-09 02:15:32.961 UTC [9353] LOG:  starting PostgreSQL 12.6 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 9.3.0, 64-bit
2021-03-09 02:15:32.961 UTC [9353] LOG:  listening on IPv6 address "::1", port 5432
2021-03-09 02:15:32.961 UTC [9353] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2021-03-09 02:15:32.962 UTC [9353] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2021-03-09 02:15:32.969 UTC [9354] LOG:  database system was shut down at 2021-03-09 02:10:25 UTC
2021-03-09 02:15:32.971 UTC [9353] LOG:  database system is ready to accept connections

Move Hydra's prometheus exporter into a different repo?

I'm also interested in monitoring Hydra instances using a prometheus exporter, however I'm rather hesitant to simply include delft/prometheus/hydra-queue-runner-reexporter.py from here without having releases and probably sudden changes (as breaking changes aren't a new major here).

I'd suggest to create a new repository (such as nixos/hydra-prometheus-exporter) and maintain the tool there.

In case we'd do this, I'd also help maintaining the repo and package.

update AWS AMIs and Virtual Box Images to 19.09

A number of people asked me when we were updating the images to 19.09. Opening this issue to track it.

@edolstra @rbvermaa @AmineChikhaoui

t2a unreachable

Oct 20 13:18:21 ceres hydra-queue-runner[10314]: possibly transient failure building ‘/nix/store/vlpwymfgjw6ankg906aqkjfsb30yk41a-nixpkgs-metrics.drv’ on ‘[email protected]’: error: --- Error --- hydra-queue-runner
Oct 20 13:18:21 ceres hydra-queue-runner[10314]: cannot connect to ‘[email protected]’: ssh: connect to host t2a.cunat.cz port 22: No route to host

Packet builders have broken IPv6 connectivity

This causes a few seconds delay for every SSH connection since hydra.nixos.org tries to connect via IPv6 first, e.g.

$ ssh -i /var/lib/hydra/queue-runner/.ssh/id_buildfarm_rsa -v [email protected]
OpenSSH_7.9p1, OpenSSL 1.0.2r  26 Feb 2019
debug1: Reading configuration data /root/.ssh/config
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 84: Applying options for *
debug1: auto-mux: Trying existing master
debug1: Control socket "/root/.ssh/control-d2410326.packethost.net-22-root" does not exist
debug1: Connecting to d2410326.packethost.net [2604:1380:2000:be00::1] port 22.
(...delay...)
debug1: connect to address 2604:1380:2000:be00::1 port 22: No route to host
debug1: Connecting to d2410326.packethost.net [147.75.100.189] port 22.
debug1: Connection established.

linked cache.nixos.org diagnostics script outdated

Currently https://gist.github.com/grahamc/df1bb806eb3552650d03eef7036a72ba is linked as a diagnostics script for cache/CDN issues from https://cache.nixos.org/, but it's still geared towards cloudfront instead of Fastly.

I couldn't find any similar script provided by Fastly directly, so am not sure if there are any other particular domains or hosts that need checking aside from cache.nixos.org itself, thus am not proposing a concrete change here.

@thoughtpolice @grahamc

Hydra down with failed DBI connection

Hydra is down. Looks to be database related.

DBIx::Class::Storage::DBI::catch {...} (): DBI Connection failed: DBI connect('dbname=hydra;host=10.254.1.9;user=hydra;','',...) failed: could not connect to server: Connection refused
Is the server running on host "10.254.1.9" and accepting
TCP/IP connections on port 5432? at /nix/store/rbx4maml998p56s3ilfwr6xz82bgd3vy-hydra-perl-deps/lib/perl5/site_perl/5.32.0/DBIx/Class/Storage/DBI.pm line 1517. at /nix/store/0k21rc23irjqhn6y8ahjqbb998jdgakh-hydra-0.1.20210202.bc12fe1/libexec/hydra/lib/Hydra/Helper/CatalystUtils.pm line 420

Some of my permissions on hydra.nixos.org have disappeared

I used to be able to stop and restart jobs on hydra.nixos.org. I also used to be able to move builds to the front of the queue. A couple of weeks ago, however, I lost those privileges. Is there any particular reason why those abilities were removed from my user?

Hydra: bigmac(-guest) machine is broken apparently

The bigmac-guest machine keeps eating builds in this style:

unpacking sources
unpacking source archive /nix/store/xklry47z0gxb3gc4697hjzj5imnadvcv-rust-1.30.1-x86_64-apple-darwin.tar.gz
tar: Skipping to next header

gzip: stdin: unexpected end of file
tar: Child returned status 1
tar: Error is not recoverable: exiting now
do not know how to unpack source archive /nix/store/xklry47z0gxb3gc4697hjzj5imnadvcv-rust-1.30.1-x86_64-apple-darwin.tar.gz
builder for '/nix/store/560apllzcpnglhsrb168v3lv2774wbmb-rustc-bootstrap-1.30.1.drv' failed with exit code 1

I've tried some of the archives and they're fine. Some other machines completed some of the jobs fine when restarted. /cc @grahamc @copumpkin

I'm not sure about the people to ping or even if this is the best channel to open this thread.

nixos / infra Goto Github PK

infra's Issues

Issue description

Steps to reproduce

Things to solve this issue:

original contents below

Recommend Projects

Recommend Topics

Recommend Org