redhat-cip / edeploy Goto Github PK
View Code? Open in Web Editor NEWLinux systems provisioning and updating made easy
License: Apache License 2.0
Linux systems provisioning and updating made easy
License: Apache License 2.0
HI,
We have two servers with the same specs (disk, memory, cpu), the only thing different is the serial number. If the servers are installed in the same time, one of them raised an error about the CMDB file (no entry.....).
If i boot the servers not at the same time, the deployment works.
In the specs file we match the serial number (tag):
generate({'disk0': '32:0',
'disk1': '32:1',
'disk2': '32:2',
'disk3': '32:3',
'disk4': '32:4',
'disk5': '32:5',
'disk6': '32:6',
'gateway-admin': '10.154.20.3',
'hostname': 'infra01-2',
'if_b0': 'eth0',
'if_b1': 'eth1',
'if_b2': 'eth2',
'ip-admin': '10.154.20.13-14',
'netmask-admin': '255.255.255.0',
'sn': ('CXXXB2Y1', '9TZUYDEY1'),
'vlan-admin': '2320',
'vlan-infra-pub': '1302'})
The spec file:
# -*- python -*-
[
('pdisk', 'disk0', 'ctrl', '1'),
('pdisk', 'disk0', 'type', 'SAS'),
('pdisk', 'disk0', 'id', '32:0'),
('pdisk', 'disk0', 'size', '278.875'),
('pdisk', 'disk1', 'ctrl', '1'),
('pdisk', 'disk1', 'type', 'SAS'),
('pdisk', 'disk1', 'id', '32:1'),
('pdisk', 'disk1', 'size', '278.875'),
('pdisk', 'disk2', 'ctrl', '1'),
('pdisk', 'disk2', 'type', 'SAS'),
('pdisk', 'disk2', 'id', '32:2'),
('pdisk', 'disk2', 'size', '278.875'),
('pdisk', 'disk3', 'ctrl', '1'),
('pdisk', 'disk3', 'type', 'SAS'),
('pdisk', 'disk3', 'id', '32:3'),
('pdisk', 'disk3', 'size', '278.875'),
('pdisk', 'disk4', 'ctrl', '1'),
('pdisk', 'disk4', 'type', 'SAS'),
('pdisk', 'disk4', 'id', '32:4'),
('pdisk', 'disk4', 'size', '278.875'),
('pdisk', 'disk5', 'ctrl', '1'),
('pdisk', 'disk5', 'type', 'SAS'),
('pdisk', 'disk5', 'id', '32:5'),
('pdisk', 'disk5', 'size', '278.875'),
('pdisk', 'disk6', 'ctrl', '1'),
('pdisk', 'disk6', 'type', 'SAS'),
('pdisk', 'disk6', 'id', '32:6'),
('pdisk', 'disk6', 'size', '278.875'),
('system', 'product', 'serial', '$$sn'),
]
+ rpm -ivh --root /var/tmp/CI/all-centos-base/install/C6.5-H.1.0.0/base http://mirror.centos.org/centos/6.5/os/x86_64/Packages/centos-release-6-5.el6.centos.11.1.x86_64.rpm
rpm: RPM should not be used directly install RPM packages, use Alien instead!
rpm: However assuming you know what you are doing...
Retrieving http://mirror.centos.org/centos/6.5/os/x86_64/Packages/centos-release-6-5.el6.centos.11.1.x86_64.rpm
error: skipping http://mirror.centos.org/centos/6.5/os/x86_64/Packages/centos-release-6-5.el6.centos.11.1.x86_64.rpm - transfer failed
Retrieving http://mirror.centos.org/centos/6.5/os/x86_64/Packages/centos-release-6-5.el6.centos.11.1.x86_64.rpm
globals are bad and evil :
http://stackoverflow.com/a/19158418/145125
there is a lot of usage of globals throughout eDeploy's code which should be fixed.
I was wondering if we could also revert this commit ff9e3a4. I am not sure to what extent this could damage the system, but I believe that many packages rely on policy-rc.d in there pre/post-installs. As packages are still allowed to be installed on the system (by using the appropriate package manager), this would potentially break package installations.
The idea here is to facilitate as much as possible the installation/debug process for the engineers that are on-field.
When trying to install java-1.6.0-openjdk it fails with:
While the package have been installed.
The pd reports like :
physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 300 GB, OK)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 300 GB, OK)
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 300 GB, OK, spare)
The Resulting hw.py looks like
('disk', '1I:1:3', 'slot', '0'),
('disk', '1I:1:3', 'size', 'OK'),
('disk', '1I:1:3', 'status', 'OK'),
The size is taken from the health argument not from the size one as spare does exists. Note that type is shifted too.
We have to improve the parsing or only rely on the pd_all information.
When deploying on a customer site, we had a vlan-tagged network. Edeploy is not able to do an installation in such network context.
We had to setup a separate network to make it work.
I suggest adding to the IP= syntax the support of VLANs like
IP=eth0:dhcp#7
The #7 could mean using vlan7 on eth0.
echo "deb http://hwraid.le-vert.net/debian wheezy main" > $target/etc/apt/sources.list.d/hwraid.list
You should use $dist variable.
I get the problem with Debian. I suppose we will need to call dkms manually later.
Setting up dkms (2.2.0.3-1.2) ...
Setting up openvswitch-datapath-dkms (1.4.2+git20120612-9.1~deb7u1) ...
Creating symlink /var/lib/dkms/openvswitch/1.4.2+git20120612/source ->
/usr/src/openvswitch-1.4.2+git20120612
DKMS: add completed.
Error! Your kernel headers for kernel 3.11-2-amd64 cannot be found.
Please install the linux-headers-3.11-2-amd64 package,
or use the --kernelsourcedir option to tell DKMS where it's located
Since 3b65639 , apt-get and yum are renamed to apt-get.moved and yum.moved at the end of the builds.
When a role depend on another one, say base, the common_setup() function should be called to restore these links.
Sadly, this common_setup() is not called in most of the roles. This breaks the build.
It seems that code configuration tests and static files are all in the root directory.
I suggest to do some shuffling around to make it more tidy and standard compared to the other oss projects :
or some other names but at least some tidying up.
same goes for some tox.ini launching the tests with their deps instead of having to do it by hand and a proper setup.py
We had a case with :
How can we manage such non-contigous and reversed range in the CMDB by using the generate() syntax ?
We did it by hand and it was very painful.
Looking at this commit:
3b65639
I believe moving these binaries brings out multiple issues:
I would suggest that we put in place a debug build in which these commands are accessible. And a production build in which they become not or hardly accessible.
Comments?
When installing a server that have some instability issues, you might have some very long kernel traces that are difficult to catch.
When booting AHC or PXE role on the server, it would be very useful to save the dmesg at a early stage on an usb key to get a copy of what's happening.
AHC have /ahcexport that could be used for this purpose, PXE shall use the same way of doing this.
When you have setups with various disks, it could be useful to get the by-id/ path in the logical's disk properties.
Instead of providing a list of IP addresses, most larger providers simply have subnet blocks attached to networks. For example:
vlan224: 192.168.24.0/24
vlan228: 192.168.28.0/24
vlan316: 172.16.16.0/24
Using these subnet definitions instead of a list of individual hosts can make larger deployments much easier. Accounting should still be done on a per-IP basis, but those entries can be written on demand to an accounting file, and absence of an entry would mean the IP address is available.
Traceback (most recent call last):
File "/home/erwan/Devel/edeploy/server/try_match", line 43, in
if matcher.match_all(hw_items, specs, var, var2, debug=True):
File "/home/erwan/Devel/edeploy/server/matcher.py", line 162, in match_all
line = match_spec(spec, lines, arr)
File "/home/erwan/Devel/edeploy/server/matcher.py", line 101, in match_spec
if spec[idx][0] == '$':
TypeError: 'int' object has no attribute 'getitem'
That's pretty strange as we have two very similar hosts were one is matching and the other doesn't.
It sounds this line is generating the bug but I don't really get why :
('disk', 'logical', 'count', 3),
http://pubz.free.fr/ProLiantDL360pGen8666532B21-HP-CZJ24900P9.hw
http://pubz.free.fr/ProLiantDL360pGen8666532B21-HP-CZJ24900P8.hw
Currently, on the install-server-RH7.0-I.1.2.0.img, the grub configuration is as follow
load_video
set gfxpayload=keep
insmod gzio
insmod ext2
if [ x$feature_platform_search_hint = xy ]; then
search --no-floppy --fs-uuid --set=root 1fe9121d-6a63-437c-9eb7-d609cbd444f1 1fe9121d-6a63-437c-9eb7-d609cbd444f1
search --no-floppy --fs-uuid --set=root 1fe9121d-6a63-437c-9eb7-d609cbd444f1 1fe9121d-6a63-437c-9eb7-d609cbd444f1
else
search --no-floppy --fs-uuid --set=root 1fe9121d-6a63-437c-9eb7-d609cbd444f1
search --no-floppy --fs-uuid --set=root 1fe9121d-6a63-437c-9eb7-d609cbd444f1
fi
linux16 /boot/vmlinuz-3.10.0-123.el7.x86_64 root=UUID=1fe9121d-6a63-437c-9eb7-d609cbd444f1 ro console=ttyS0
linux16 /boot/vmlinuz-3.10.0-123.el7.x86_64 root=UUID=1fe9121d-6a63-437c-9eb7-d609cbd444f1 ro console=ttyS0
initrd16 /boot/initramfs-3.10.0-123.el7.x86_64.img
See how 3 lines are redundant, this prevent the OS from booting
Other packages are installed by apt-get install -y --force-yes. Python-pip seems to be install by ansible.
+ install_edeploy
+ type -p ansible-playbook
/usr/local/bin/ansible-playbook
+ do_chroot /var/lib/jenkins/jobs/SoftwareFactory-functional-tests/workspace/roles/install/D7-H.1.0.0/install-server apt-get install python-pip
+ local chdir=/var/lib/jenkins/jobs/SoftwareFactory-functional-tests/workspace/roles/install/D7-H.1.0.0/install-server
+ shift
+ PATH=/bin/:/sbin:/sbin:/bin::/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
+ LANG=C
+ LC_ALL=C
+ LC_CTYPE=C
+ LANGUAGE=C
+ chroot /var/lib/jenkins/jobs/SoftwareFactory-functional-tests/workspace/roles/install/D7-H.1.0.0/install-server apt-get install python-pip
Reading package lists...
Building dependency tree...
Reading state information...
The following extra packages will be installed:
python-pkg-resources python-setuptools python2.6 python2.6-minimal
Suggested packages:
python-distribute python-distribute-doc python2.6-doc binutils
binfmt-support
Recommended packages:
python-dev-all build-essential
The following NEW packages will be installed:
python-pip python-pkg-resources python-setuptools python2.6
python2.6-minimal
0 upgraded, 5 newly installed, 0 to remove and 0 not upgraded.
Need to get 4681 kB of archives.
After this operation, 15.2 MB of additional disk space will be used.
Do you want to continue [Y/n]? Abort.
+ cleanup
+ ret=1
Hi,
The deployment cannot succeed if we have multiple kernel installed.
From what Erwan Velu observed it is due to the init file (/srv/edeploy/build/init) on line 290-300:
case "$ONSUCCESS" in "kexec") log "Booting with kexec as required by ONSUCCESS" if type -p kexec; then log_n "Trying kexec..." cp $d/boot/vmlinuz* /tmp/vmlinuz || give_up "Unable to copy kernel" if ls $d/boot/initrd.img*; then cp $d/boot/initrd.img* /tmp/initrd.img || give_up "Unable to copy initrd" else cp $d/boot/initramfs* /tmp/initrd.img || give_up "Unable to copy initrd" fi
So if we have a directory with multiple kernel inside that won't work.
A way to solve it would be to specify the kernel/initrd on which you want to boot (when multiple kernel), or boot on the only and default one in the other case.
Cheers,
With @sbadia , we are facing different issue due to badly configured internal clock. I think it would be very interesting to have chrony or ntpd installed in the base install and started during the boot process.
Personnaly I think chrony is better than ntpd because it supports more exotic configurations (Active Directory, network lag, etc) and it is more robust.
python tests are inside the main src directory which is not really standard at least should sit inside their own directory in src/tests/
we should as well use a setup.py as well with a tox.ini for virtualenv
When booting with the USB sticks, it's usually done on USB1 thanks to the bios dev aka the crack smokers.
The syslinux banners appear pretty quickly but the kernel/initrd doesn't thanks to the quiet option of the Linux kernel. The loading is pretty long and if your usb key doesn't provide a blinking led you can consider the server as frozen.
So let's just put a message after the syslinux banner to inform users that system is current beeing loading from the boot device.
On a typical D7 usage, we were missing some very low-level tools :
ethtool
hdparm
host
nslookup
netcat
iptables
The raid/disk/ldrive detection is a sequential loop based on simple for loop :
https://github.com/enovance/edeploy/blob/master/src/detect.py#L75-L79
https://github.com/enovance/edeploy/blob/master/src/detect.py#L99-L101
In most cases, it will work. But if we have disks at the beginning and the end of the raid card with empty slots between, it fails.
10 disks : 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | x | x => OK
10 disks : 0 | 1 | 2 | 3 | 4 | x | x | 5 | 6 | 7 | 8 | 9 => KO
It's the same if we have multiple raid cards like HP smartarray :
# hpacucli ctrl all show
Smart Array P812 in Slot 1 (sn: xxxxxxxxxxxxxxxxxx)
Smart Array P812 in Slot 4 (sn: xxxxxxxxxxxxxxxxxx)
We have slot number 1 and 4 and the script wants 0 and 1.
And it's also the same problem for logical drive.
When running AHC, the benchmark runs and upload to SERV at the end of the run.
But if the upload-check isn't avail or if the IP is incorrect, the benchmark take minutes to run and then fail at uploading ... that's pretty sad waiting so much to know the server isn't available.
We shall check the service is ready before starting the benchmark.
Hi,
One quick question, why use hdparm -z to reload the partitions table ?
There is blockdev which is installed with util-linux and partprobe installed with parted.
Why use an another external command ?
One more thing, i believe that sdparm gives a better support for SCSI/SAS disks.
Thanks.
When an installation fails, we have to reincrement the role's counter on the server side usually on the CGI server.
The profile is detected by getting some information inside the configure script. During the setup we had several case where the failure didn't got properly reported.
It would be necessary to to this information as soon a possible inside the CGI stream and also having a better parsing against special chars inside the profile name.
Server A have 1 disk : 500GB
Server B have 2 disks of the same type of server A
Server A & B are really the same the but the disks. If you do a rule that match a single disk, both servers will match.
We shall add the number of physical & logical disks inside the profile to avoid such case.
Reported by @fcharlier in a real case.
Hi,
when trying to build a new role containing snmpd package i got the following error :
Setting up snmpd (5.4.3~dfsg-2+squeeze1) ...
+ [ xconfigure = xconfigure ]
+ getent group snmp
+ [ ! ]
+ deluser --quiet --system snmp
+ adduser --quiet --system --group --no-create-home --home /var/lib/snmp snmp
ORIG ['/usr/sbin/groupadd', '-g', '105', 'snmp']
mngids.py: found --gid at 1 for val[snmp]=140
['/usr/sbin/groupadd.real', '-g', '140', 'snmp']
ORIG ['/usr/sbin/useradd', '-d', '/var/lib/snmp', '-g', 'snmp', '-s', '/bin/false', '-u', '103', 'snmp']
mngids.py: found --gid at 3 for val[snmp]=140
mngids.py: found --uid at 7 for val[snmp]=127
['/usr/sbin/useradd.real', '-d', '/var/lib/snmp', '-g', '140', '-s', '/bin/false', '-u', '127', 'snmp']
+ chown -R snmp:snmp /var/lib/snmp
+ . /usr/share/debconf/confmodule
+ [ ! ]
+ PERL_DL_NONLAZY=1
+ export PERL_DL_NONLAZY
+ [ ]
+ exec /usr/share/debconf/frontend /var/lib/dpkg/info/snmpd.postinst configure
+ [ xconfigure = xconfigure ]
+ getent group snmp
+ [ ! ]
+ deluser --quiet --system snmp
+ adduser --quiet --system --group --no-create-home --home /var/lib/snmp snmp
+ chown -R snmp:snmp /var/lib/snmp
+ . /usr/share/debconf/confmodule
+ [ ! 1 ]
+ [ -z ]
+ exec
+ [ ]
+ exec
+ DEBCONF_REDIR=1
+ export DEBCONF_REDIR
+ db_version 2.0
+ _db_cmd VERSION 2.0
+ IFS= printf %s\n VERSION 2.0
+ IFS=
read -r _db_internal_line
+ RET=20 Unsupported command "orig" (full line was "ORIG ['/usr/sbin/groupadd', '-g', '105', 'snmp']") received from confmodule.
+ return 20
dpkg: error processing snmpd (--configure):
subprocess installed post-installation script returned error exit status 128
configured to not write apport reports
Errors were encountered while processing:
snmpd
E: Sub-process /usr/bin/dpkg returned an error code (1)
it seems to be linked to mngids.py :
116 def main():
117 uids = {}
118 gids = {}
119
120 print('ORIG', sys.argv)
121
122 IDS = '/root/ids.tables'
Currently cron and logrotate doesn't seem to be installed / enabled in openstack-full roles, we should ensure it's installed.
DVER=U12
DIST=precise
PVER=H
Note, selecting 'neutron-vpn-agent' instead of 'neutron-plugin-vpn-agent'
E: Unable to locate package mongodb-10gen
The build directory come with the roles. Those roles are IMO configuration files. They will have there own release (version/tag/etc). I think we should move the outside of the eDeploy repository.
When you are setting up a cloud, you have to write CMDBs.
The easiest way to make it, is to use the generate() syntax. Once you run it for the first time, the CMDB is expanded. If you made a mistake, the generate() version of the file is lost and you have to write it again.
It would be nice that if we have a generate() version of the file to make a backup of it before expanding it. This way, you keep the original version & the expanded one.
AHC considers that we are running on a usb stick if /ahcexport is found.
If for any reason this check fails, and I had the case, the bootable device is considered as a disk that can be benchmarked.
On the destructive mode, the usb key is smashed....
img.install shall pass a parameter to inform that we are booting on a usb device and if no ahcexport is found, we shall stop as we are not able to find what was the boot device....
Case :
But edeploy is case sensitive and compare @mac with @MacInCapital so it doesn't match.
Possible to implement insensitive case in edeploy match (just for mac ?)
RHEL and CentOS base role provide vi whereas there is no editor in Debian base configuration. This is a real pain when we have to do some trivial admin task.
Would it be possible to add an light editor like nano or vim in the base role.
ctrl slot=X delete forced OR self._sendline('ctrl %s delete forced' % selector)
the actual error gets cut off in the logs as it is too long, but hpacucli refuses to destroy the array if there are mounted volumes. If this error is caught properly you could unmount both directories and then continue to destroy the array.
Useful when you are running ./configure over to find out where your errors are.
When building the openstack-full role on Ubuntu, the upstart jobs are not disabled.
Running megacli.py on a machine with 2 enclosures detected, it fails querying disks for the first enclosure and queries the second enclosure instead.
See logs here: https://gist.github.com/fcharlier/9662117
"shell: service apache2 restart" is replaced with "shell:#" for ansible/pxemngr-install.yml in deploy.install, so the test fail:
NOTIFIED: [enable pxemngr site] ***********************************************
failed: [chroot] => {"failed": true, "rc": 256}
msg: no command given
1M : SUMMARY : 2 consistent hosts with 12753.50 MB/s as average value and 460.50 standard deviation
1M : SUMMARY : 2 unstable hosts with 12753.50 MB/s as average value and 460.50 standard deviation
Hi,
Due to the fact that the base role exclude by default (from base.exclude):
[...] /etc/ssh/ssh_host_dsa_key /etc/ssh/ssh_host_dsa_key.pub /etc/ssh/ssh_host_ecdsa_key /etc/ssh/ssh_host_ecdsa_key.pub /etc/ssh/ssh_host_rsa_key /etc/ssh/ssh_host_rsa_key.pub [...]
We can't connect to freshly deployed machine (from auth.log):
Could not load host key: /etc/ssh/ssh_host_key Could not load host key: /etc/ssh/ssh_host_dsa_key Disabling protocol version 1. Could not load host key Disabling protocol version 2. Could not load host key
As soon as the keys are regenerated:
# ssh-keygen -t dsa -f /etc/ssh/ssh_host_dsa_key # ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key
The server becomes accessible via SSH.
Don't know if it's meant to be that way, or if key should be regenerated at every deployment to avoid same dsa|rsa keys on multiple servers.
Cheers,
It doesn’t because we have:
root@osc-ctrl1-cdc:~# ls -alh /var/log/syslog*
-rw-r----- 1 root adm 0 Feb 7 07:25 /var/log/syslog
-rw-r----- 1 root adm 537M Feb 8 19:56 /var/log/syslog.1
Since size is 0, it doesn’t rotate, however /var/log/syslog.1 keeps growing
Permissions looks good though…
I fixed that by restarting the rsyslog process.
In the situation postrotate is not called since no logs were rotated.
HOWEVER we deny the execution and invocation of rc with the following script
"/usr/sbin/policy-rc.d”
in which we use exit 101, this denies service execution.
This is the problem, when we manually restart rsyslog, this creates the new /var/log/syslog file, however when it’s time to rotate again we can’t because the rc invocation is denied.
Today, hpacucli offer a high level API to create/delete raid arrays.
megacli is lacking of one and requires people to do some megacli directly inside their configure.
It would be very useful to get a common API being able to manipulate raid arrays the same way from the configure.
That would reduce the amount of knwolegde to manage raid arrays but also avoid mistakes.
Hi,
I tried to deploy Dell servers but I discovered that megacli is not supported by eDeploy. I tried to found out why. First there is no "import megacli" in my configure file. I don't know how this file is generated. Then I took a look to megacli.py and it seems that there is not a lot of code. Anyway could you add megacli support? I can help by giving you access to Dell servers if you like :)
When booting a PXE installation or a benchmark (which is almost the same code) having a single console that is stuck at doing its stuff is sometimes a pain. When stuff goes wrong or slow, you would love having another console to check how it goes and run some commands.
Adding a 2nd console on tty1 would be very useful for debugging.
During a deployment, some information stored on the default_grub where not present at booting time. Sounds like something is wrong here.
This is part of init script.
Hi,
@goldyfruit has worked on the Dell integration into eDeploy. It would great if we could work on merging the code into the master branch.
https://github.com/enovance/edeploy/tree/dell
Tell me if I can help in any way.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.