Giter Site home page Giter Site logo

seagate / cortx-prvsnr Goto Github PK

View Code? Open in Web Editor NEW
17.0 15.0 40.0 6.68 MB

CORTX Provisioner offers a framework which accepts configurations (cluster.yaml and config.yaml) in the form of ConfigMap, translates into internal configuration (CORTX Conf Store) and then orchestrates across components mini provisioners to allow them to configure services. In Kubernetes environment, CORTX Provisioner framework runs on all the CORTX PODs (in a separate one time init container).

Home Page: https://github.com/Seagate/cortx

License: GNU Affero General Public License v3.0

Python 93.00% Shell 7.00%
provisioning provisioner salt-formula saltstack-formula

cortx-prvsnr's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cortx-prvsnr's Issues

deploy-eos fails to apply `components.misc.build_ssl_cert_rpms` on eosnode-2

Host verification fails when eosnode-2 is trying to reach eosnode-1 since fqdn of the eosnode-1 from cluster.sls might not match predefined entities from http://gitlab.mero.colo.seagate.com/eos/provisioner/ees-prvsnr/issues/15

eosnode-2:
----------
          ID: Copy certs from primary
    Function: cmd.run
        Name: scp 10.237.128.253:/opt/seagate/stx-s3-*.rpm /opt/seagate
      Result: False
     Comment: Command "scp 10.237.128.253:/opt/seagate/stx-s3-*.rpm /opt/seagate" run
     Started: 09:01:18.209977
    Duration: 1292.265 ms
     Changes:
              ----------
              pid:
                  8006
              retcode:
                  1
              stderr:
                  Host key verification failed.
              stdout:

Summary for eosnode-2
------------
Succeeded: 0 (changed=1)
Failed:    1
------------
Total states run:     1
Total run time:   1.292 s

`components.misc.build_ssl_cert_rpms` states for eosnode-1 are non-idempotent

Repetitive appliance of the components.misc.build_ssl_cert_rpms states shows:

Summary for eosnode-1
-------------
Succeeded: 15 (changed=14)
Failed:     0
-------------
Total states run:     15
Total run time:   80.118 s
Summary for eosnode-1
-------------
Succeeded: 15 (changed=13)
Failed:     0
-------------
Total states run:     15
Total run time:   14.167 s

jinja conditionals makes some salt states appliance logic unclear

Currently salt states conditional appliance is controlled by pillar and jinja.
That leads to a very different logic depending on the pillar values which might be understood only by deep code exploration.

E.g. halon facts are generated either during halon's config phase or global post-setup phase depending on the cluster mode (single or ees):
http://gitlab.mero.colo.seagate.com/eos/provisioner/ees-prvsnr/blob/master/srv/components/halon/config/init.sls
http://gitlab.mero.colo.seagate.com/eos/provisioner/ees-prvsnr/blob/master/srv/components/post_setup/init.sls

I believe the better option would be to set a relations between target minions classes (e.g. primary, secondary, primary-singlenode) and states appliance order which is defined on the top level of states.

rpm doens't include/apply network configuration files

Currently network configuration is not possible having just provisioner rpm installed since network configuration scripts (http://gitlab.mero.colo.seagate.com/eos/provisioner/ees-prvsnr/tree/master/files/etc/sysconfig) are not a part of the rpm (http://gitlab.mero.colo.seagate.com/eos/provisioner/ees-prvsnr/blob/master/rpms/eos-prvsnr.spec).
Options:

  1. add to provisioner rpm, sub-options:
    1. just place them in the installation directory
    2. apply them during installation
  2. add to provisioner cli rpm (but that would contradict with the fact that we include salt configs into the rpm and apply them during the installation)
  3. do the network configuration as part of salt system component appliance (seems the best option)

Missed docs for CLI scripts

!78 introduces significant changes in CLI scripts internals along with some API refinements. Thus it is required to:

  • provide markdown for cli folder with overview, usage details and examples
  • add comments to shell scripts in cli/src

missed logging in cli scripts

missed in #78
Current verbosity of cli scripts (especially for setup-provisioner) is not enough to understand what is happening and why something has gone wrong. Need to improve (that relates to verbosity option).

Problem: "Check system hostname" step fails on fresh hardware system

This happens when following QuickStart Guide just after OS reimage, so the fix has to be applied either to provisioner or to LCO lab infrastructure.

More details:

520406@smc19-m10:~> time sudo sh /opt/seagate/eos-prvsnr/cli/deploy-eos -S
INFO: Applying 'components.system'
eosnode-1:
<...>
----------
          ID: Check system hostname
    Function: cmd.run
        Name: test $(salt --no-color eosnode-1 grains.get host|tail -1|tr -d "[:blank:]") == $(hostname)
      Result: False
     Comment: Command "test $(salt --no-color eosnode-1 grains.get host|tail -1|tr -d "[:blank:]") == $(hostname)" run
     Started: 14:46:24.997655
    Duration: 738.599 ms
     Changes:
              ----------
              pid:
                  113255
              retcode:
                  1
              stderr:
              stdout:
----------

At the same time:

> hostname
smc19-m10.mero.colo.seagate.com
> sudo salt --no-color eosnode-1 grains.get host
eosnode-1:
    smc19-m10
520406@smc19-m10:~> cat /etc/hostname
smc19-m10.mero.colo.seagate.com

Cluster configuration:

> cat /opt/seagate/eos-prvsnr/pillar/components/cluster.sls
cluster:
  type: single        # single/ees/ecs
  node_list:
    - eosnode-1
  eosnode-1:
    hostname: smc19-m10
    is_primary: true
    network:
      mgmt_if: eno1                   # Management network interfaces for bonding
      data_if: bond0                  # Management network interfaces for bonding
      gateway_ip:                     # No Implementation
    storage:
      metadata_device:                # Device for /var/mero and possibly SWAP
        - /dev/mapper/mpathb
      data_devices:                   # Data device/LUN from storage enclosure
        - /dev/mapper/mpathc
        - /dev/mapper/mpathd
        - /dev/mapper/mpathe
        - /dev/mapper/mpathf
        - /dev/mapper/mpathg
        - /dev/mapper/mpathn
        - /dev/mapper/mpatho
  storage_enclosure:
    id: storage_node_1            # equivalent to fqdn for server node
    type: 5U84                    # Type of enclosure. E.g. 5U84/PODS
    controller:
      type: gallium               # Type of controller on storage node. E.g. gallium/indium/sati
      primary_mc:
        ip: 127.0.0.1
        port: 8090
      secondary_mc:
        ip: 127.0.0.1
        port: 8090
      user: user
      password: 'passwd'

sudo salt '*' grains.ls

Manage /etc/hosts to cater to Halon and S3Clients

Should address:
Avoid setting up hostname provided in cluster.sls, it causes dns issues
in halon while generating facts also if instead of hostname the ip is
specified in cluster.sls for the hostname it breaks many things.

Remove openhpi from components.sspl.install

Based on DevOps Room chat:

> openhpid[1755]: openhpid: plugin.c:589: A handler #1 on the libipmidirect plugin could not be opened. The so lib is presented in /usr/lib64/openhpi.
HPIMnitor is disabled in default sspl.conf also, so we dont ideally need openhpid installation for EES ? Did you have any old installtion of sspl, for which openhpid was installed ? You may ignore openhpid error log or rather uninstall openhpid to stop getting that error

openhpid is requried no more and needs to be removed.

While components.sspl.install is re-visited, it would be a good idea to reconsider the entire list of dependencies.

Problem: Step 1 from QuickStart Guide fails on fresh system due to invalid salt version

Reference: http://gitlab.mero.colo.seagate.com/eos/provisioner/ees-prvsnr/wikis/Setup-Guides/QuickStart-Guide

Install Provisioner CLI rpm (eos-prvsnr-cli)from from the eos release repo:

This command fails with the following message:

520406@smc19-m10:~> sudo yum install -y http://ci-storage.mero.colo.seagate.com/releases/eos/components/dev/centos-7.7.1908/provisioner/106/eos-prvsnr-1.0.0-106_git8f24627_el7.x86_64.rpm http://ci-storage.mero.colo.seagate.com/releases/eos/components/dev/centos-7.7.1908/provisioner/106/eos-prvsnr-cli-1.0.0-106_git8f24627_el7.x86_64.rpm
Loaded plugins: enabled_repos_upload, fastestmirror, package_upload, product-id, search-disabled-repos, subscription-manager
eos-prvsnr-1.0.0-106_git8f24627_el7.x86_64.rpm                                                                                                                | 194 kB  00:00:00
Examining /var/tmp/yum-root-KerS3K/eos-prvsnr-1.0.0-106_git8f24627_el7.x86_64.rpm: eos-prvsnr-1.0.0-106_git8f24627_el7.x86_64
Marking /var/tmp/yum-root-KerS3K/eos-prvsnr-1.0.0-106_git8f24627_el7.x86_64.rpm to be installed
eos-prvsnr-cli-1.0.0-106_git8f24627_el7.x86_64.rpm                                                                                                            |  37 kB  00:00:00
Examining /var/tmp/yum-root-KerS3K/eos-prvsnr-cli-1.0.0-106_git8f24627_el7.x86_64.rpm: eos-prvsnr-cli-1.0.0-106_git8f24627_el7.x86_64
Marking /var/tmp/yum-root-KerS3K/eos-prvsnr-cli-1.0.0-106_git8f24627_el7.x86_64.rpm to be installed
Resolving Dependencies
--> Running transaction check
---> Package eos-prvsnr.x86_64 0:1.0.0-106_git8f24627_el7 will be installed
--> Processing Dependency: salt-master = 2019.2.0 for package: eos-prvsnr-1.0.0-106_git8f24627_el7.x86_64
Loading mirror speeds from cached hostfile
EOS_CentOS-7_CentOS-7-Extras                                                                                                                                  | 2.1 kB  00:00:00
EOS_CentOS-7_CentOS-7-OS                                                                                                                                      | 2.1 kB  00:00:00
EOS_CentOS-7_CentOS-7-Updates                                                                                                                                 | 2.1 kB  00:00:00
EOS_CentOS-7_EPEL-7                                                                                                                                           | 2.1 kB  00:00:00
EOS_CentOS-7_Katello-Client                                                                                                                                   | 2.1 kB  00:00:00
EOS_CentOS-7_mlnx_ofed-4_7-3_2_9_0                                                                                                                            | 2.1 kB  00:00:00
--> Processing Dependency: salt-minion = 2019.2.0 for package: eos-prvsnr-1.0.0-106_git8f24627_el7.x86_64
--> Processing Dependency: python36 for package: eos-prvsnr-1.0.0-106_git8f24627_el7.x86_64
--> Processing Dependency: python36-PyYAML for package: eos-prvsnr-1.0.0-106_git8f24627_el7.x86_64
--> Processing Dependency: python36-pip for package: eos-prvsnr-1.0.0-106_git8f24627_el7.x86_64
---> Package eos-prvsnr-cli.x86_64 0:1.0.0-106_git8f24627_el7 will be installed
--> Processing Dependency: PyYAML for package: eos-prvsnr-cli-1.0.0-106_git8f24627_el7.x86_64
--> Running transaction check
---> Package PyYAML.x86_64 0:3.10-11.el7 will be installed
--> Processing Dependency: libyaml-0.so.2()(64bit) for package: PyYAML-3.10-11.el7.x86_64
---> Package eos-prvsnr.x86_64 0:1.0.0-106_git8f24627_el7 will be installed
--> Processing Dependency: salt-master = 2019.2.0 for package: eos-prvsnr-1.0.0-106_git8f24627_el7.x86_64
--> Processing Dependency: salt-minion = 2019.2.0 for package: eos-prvsnr-1.0.0-106_git8f24627_el7.x86_64
---> Package python3.x86_64 0:3.6.8-10.el7 will be installed
--> Processing Dependency: python3-libs(x86-64) = 3.6.8-10.el7 for package: python3-3.6.8-10.el7.x86_64
--> Processing Dependency: python3-setuptools for package: python3-3.6.8-10.el7.x86_64
--> Processing Dependency: libpython3.6m.so.1.0()(64bit) for package: python3-3.6.8-10.el7.x86_64
---> Package python3-pip.noarch 0:9.0.3-5.el7 will be installed
---> Package python36-PyYAML.x86_64 0:3.12-1.el7 will be installed
--> Running transaction check
---> Package eos-prvsnr.x86_64 0:1.0.0-106_git8f24627_el7 will be installed
--> Processing Dependency: salt-master = 2019.2.0 for package: eos-prvsnr-1.0.0-106_git8f24627_el7.x86_64
--> Processing Dependency: salt-minion = 2019.2.0 for package: eos-prvsnr-1.0.0-106_git8f24627_el7.x86_64
---> Package libyaml.x86_64 0:0.1.4-11.el7_0 will be installed
---> Package python3-libs.x86_64 0:3.6.8-10.el7 will be installed
---> Package python3-setuptools.noarch 0:39.2.0-10.el7 will be installed
--> Finished Dependency Resolution
Error: Package: eos-prvsnr-1.0.0-106_git8f24627_el7.x86_64 (/eos-prvsnr-1.0.0-106_git8f24627_el7.x86_64)
           Requires: salt-master = 2019.2.0
           Available: salt-master-2015.5.10-2.el7.noarch (EOS_CentOS-7_EPEL-7)
               salt-master = 2015.5.10-2.el7
Error: Package: eos-prvsnr-1.0.0-106_git8f24627_el7.x86_64 (/eos-prvsnr-1.0.0-106_git8f24627_el7.x86_64)
           Requires: salt-minion = 2019.2.0
           Available: salt-minion-2015.5.10-2.el7.noarch (EOS_CentOS-7_EPEL-7)
               salt-minion = 2015.5.10-2.el7
 You could try using --skip-broken to work around the problem
 You could try running: rpm -Va --nofiles --nodigest
Uploading Enabled Repositories Report
Loaded plugins: fastestmirror, product-id, subscription-manager
Loaded plugins: fastestmirror, product-id, subscription-manager
Loaded plugins: fastestmirror, product-id, subscription-manager
Loaded plugins: fastestmirror, product-id, subscription-manager
Loaded plugins: fastestmirror, product-id, subscription-manager
Loaded plugins: fastestmirror, product-id, subscription-manager

Manual workaround: copy files/etc/yum.repos.d/saltstack.repo from this repo to /etc/yum.repos.d/ on the target host, then run (not sure if required) yum clean all, then proceed with yum install -y ....

Problem: firewalld is misconfigured

Consul agents cannot communicate because of that.

Workaround is to disable firewalld manually:

salt \* service.stop firewalld
salt \* service.disable firewalld

Solution: fix firewalld configuration.

Separate out installation of SSL certificates from openLDAP component

Currently certificates are built as part of s3server component and they are installed as part of openLDAP component.
Since s3authserver and haproxy requires the certificates to be installed they depend on openldap state to be run.
s3authserver may have dependency on ldap but haproxy doesn't, it just needs the certificates installed.
It would be cleaner if the certificates were built and installed independently and then it's dependencies can be created on the states which requires them, so in future if any other component (like haproxy) needs ssl certificates just running the install certs state will be required.

Follow-up from "[EOS-1743] bootstrap-eos script"

The following discussions from #72 should be addressed:

  • @pritam.bhavsar started a discussion: (+1 comment)

    why exit when debug is true?
    can it be 'set -x' when debug is true?

  • @pritam.bhavsar started a discussion: (+1 comment)

    [SUGGESTION]:
    In next sprint we can take this out and put it in the setup provisioner script.. and may be rename setup provisioner script to something like setup nodes etc.

  • @pritam.bhavsar started a discussion: (+1 comment)

    [BUG]:
    The sequence of post_setup calls should be in reverse order.. please check the wiki.
    Post_run should be run on second node first and then on the primary node.

  • @pritam.bhavsar started a discussion: (+1 comment)

    [SUGGESTION]:
    May be in refine cli script task we can:

    1. Take common tasks in separate script/function and resue them across cli scripts, like parsing.
    2. Check the status of the commands like hctl mero bootstrap/start/stop/status and report accordingly.
    3. Handle scenarios wherein cluster goes in failed/inhibited state causing hctl commands to hung forever.
    4. define user friendly and uniform command arguments across cli scripts (may be already in place).

Controller provisioning CLI: split commands in individual files.

The controller.sh script introduced as part of EOS-2436 needs to be splitted in to separate scripts for individual commands called from the main function. It will help in case if one of stages/command fails, we might not want to repeat the whole sequence after troubleshooting/fixing. Granularity with execution would help.
Jira ticket EOS-2511 is opened for the same.

Problem: Provisioner mounts `/var/mero` directory

components.system.storage.config mounts /var/mero on every cluster node.

If /var/mero directory is mounted, Hare's build-ees-ha script (components.ha.ees_ha) will fail.

Environment: this issue is consistently reproduced on smc7-m11/smc8-m11 cluster; eos-prvsnr-cli-1.0.0-142_gitb4c8594_el7.x86_64, eos-prvsnr-1.0.0-142_gitb4c8594_el7.x86_64.

Jira: EOS-5513

Follow-up from "EOS-1742 Provisioning: FactorySetup: Provisioner CLI - configure-eos.sh script"

The following discussions from #69 should be addressed:

  • @andrey.kononykhin started a discussion: (+2 comments)

    Could you please use some ordering scheme for options. As far as I experienced some widely used linux tools opt to alphabetical by long options. E.g.

       -c, --component ...
       -u, --file  ...
       -h, --help ...,
       -d, --show-file-format ...
    

    And keep that order during arguments parsing as well, please.
    Also I think it would be better to use the same style for help texts: I mean first letter case and dot at end.
    Thanks

  • @andrey.kononykhin started a discussion: (+2 comments)

    Do you really need subshell here. It will make some things more complicated (e.g. sudo prompt if we expect that). As I see for now you just echo the output of the commands. I believe it would be available (visible) without echo as well.

  • @andrey.kononykhin started a discussion: (+4 comments)

    Do you really need that additional update_file variable?

  • @andrey.kononykhin started a discussion: (+1 comment)

    I think it make sense to provide ability to run in dry mode here. E.g. use stdout as file_path to just show what is going to happen only instead of real file creation / update.

  • @andrey.kononykhin started a discussion: (+4 comments)

    I've noticed that there are multiple lines which use tabs and it makes indentation broken on some editors if they have other tab length. Could you please replace all tabs with just spaces? Thank you

  • @andrey.kononykhin started a discussion: (+2 comments)

    I think it's good place to prepare the copy command but not the best place to apply though. What if show_file is true, do we need to perform any copy then? Also how would we deal with the case when file_path is defined and local mode is used?

Problem: cannot create CentOS 7.7 Vagrant box for VirtualBox provider

packer build images/os/centos_77_1908_vbox.json command fails.
The error message (displayed in VirtualBox window):

dracut-initqueue[739]: Warning: anaconda: failed to fetch kickstart from http://10.0.2.2:8695/kickstart_centos_77_1908.cfg

Environment

git branch: EOS-4482

$ git describe
ees1.0.0-PI.4-sprint12-34-g1a852f5

The value of ks= argument has been modified to correspond to the actual name of .conf file:

$ git diff
diff --git a/images/os/centos_77_1908_vbox.json b/images/os/centos_77_1908_vbox.json
index 9dfb81f..2b9da8d 100644
--- a/images/os/centos_77_1908_vbox.json
+++ b/images/os/centos_77_1908_vbox.json
@@ -14,7 +14,7 @@
       "iso_checksum_type": "md5",
       "boot_command": [
         "<tab><wait>",
-        " ks=http://{{ .HTTPIP }}:{{ .HTTPPort }}/ks_centos77.cfg<enter>"
+        " ks=http://{{ .HTTPIP }}:{{ .HTTPPort }}/kickstart_centos_77_1908.cfg<enter>"
       ],
       "boot_wait": "10s",
       "cpus": 2,
$ ls images/os
build                           centos_77_1908_vbox.json
centos_75_1804_hyperv.json      kickstart_centos_75_1804.cfg
centos_75_1804_vbox.json        kickstart_centos_77_1908.cfg
centos_75_1804_vmware.json
packer build output (click to toggle)
masala:ees-prvsnr (EOS-4482 *)$ packer build images/os/centos_77_1908_vbox.json
virtualbox-iso: output will be in this color.

==> virtualbox-iso: Retrieving Guest additions
==> virtualbox-iso: Trying /Applications/VirtualBox.app/Contents/MacOS/VBoxGuestAdditions.iso
==> virtualbox-iso: Trying /Applications/VirtualBox.app/Contents/MacOS/VBoxGuestAdditions.iso
==> virtualbox-iso: /Applications/VirtualBox.app/Contents/MacOS/VBoxGuestAdditions.iso => /Users/vvv/src/ees-prvsnr/packer_cache/7784a55a71d48a1e9b5c487431438fef0f19d87f.iso
==> virtualbox-iso: Retrieving ISO
==> virtualbox-iso: Trying file://C:/Users/Public/Projects/VM/CentOS-7-x86_64-Minimal-1908
==> virtualbox-iso: Trying file://C:/Users/Public/Projects/VM/CentOS-7-x86_64-Minimal-1908?checksum=md5%3A7002b56184180591a8fa08c2fe0c7338
==> virtualbox-iso: file://C:/Users/Public/Projects/VM/CentOS-7-x86_64-Minimal-1908?checksum=md5%3A7002b56184180591a8fa08c2fe0c7338 => /Users/vvv/src/ees-prvsnr/packer_cache/4efd6202b248f7601db3aefb3c1c512fb8b9cc05.iso
==> virtualbox-iso: Starting HTTP server on port 8695
==> virtualbox-iso: Creating virtual machine...
==> virtualbox-iso: Creating hard drive...
==> virtualbox-iso: Creating forwarded port mapping for communicator (SSH, WinRM, etc) (host port 3536)
==> virtualbox-iso: Starting the virtual machine...
==> virtualbox-iso: Waiting 10s for boot...
==> virtualbox-iso: Typing the boot command...
==> virtualbox-iso: Using ssh communicator to connect: 127.0.0.1
==> virtualbox-iso: Waiting for SSH to become available...

packer-cannot-fetch-kickstart

setup-provisioner fails to start salt-master and salt-minion services

After locally executing setup-provisioner for single node configuration, it is observed that following services fail to start:

  • salt-master
  • salt-minion
    As a result the key signing also fails to accomplish on master (primary) node.

Fix auto-update of:

  • /etc/salt/master
  • /etc/salt/minion
  • /etc/salt/minion_id
  • /opt/seagate/eos-prvsnr/pillar/components/cluster.sls copy from single or dual template based on cluster selection type.

EOS common system checks

Automate common system checks for EOS before provisioning starts and include it as part of components.system.sanity_check.

The checks should be:

  • Set unique hostname for eos nodes
  • Localhost fqdn entry in /etc/hosts
  • Check for /etc/salt/minion_id entry
  • Cross check grains.get fqdn against pillar.get fqdn
  • Check salt-key on primary node - Ensure all nodes are listed Accepted Keys
    example
[root@eosnode-1 ~]# salt-key -L
Accepted Keys:
eosnode-1
eosnode-2
Denied Keys:
Unaccepted Keys:
Rejected Keys:
[root@eosnode-1 ~]#
  • Ensure hostname mentioned in mini conf is in sync with fqdn
[root@eosnode-1 ~]# salt-call grains.get fqdn
local:
    eosnode-1.pun.seagate.com

[root@eosnode-1 ~]# salt-call grains.get id
local:
    eosnode-1
[root@eosnode-1 ~]# salt-call pillar.get cluster:<output from salt-call grains.get id>:fqdn
local:
    eosnode-1
  • Ensure that data interfaces has IPs allocated on all eos nodes

example - Run command - ip a s

 enp24s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether ec:0d:9a:c1:51:34 brd ff:ff:ff:ff:ff:ff
    inet 192.168.60.6/24 brd 192.168.60.255 scope global noprefixroute enp24s0f0
       valid_lft forever preferred_lft forever
    inet6 fe80::a787:d53a:3c1:1518/64 scope link noprefixroute
       valid_lft forever preferred_lft forever

Install and setup mock server for SSPL on VM

SSPL requires mock server to be running while on VM.
This is to replicate HW calls/events on VM.

The task would be to setup such mock server for VM environment.

PoC:
@malhar.vora

Issues faced while provisioning dual-node Hyper-V VM setup

Following are the issues observed and resolved while provisioning a dual-node Hyper-V VM setup:

  1. eosnode-2 did not have data0 IP assigned on VM startup.
    • Assigned static IP manually.
  2. eosnode-1 salt was not able to communicate with eosnode-2 salt.
    • Observed that if /etc/salt/pki/minion/minion_master.pub is different on both the nodes then salt communication is broken.
    • Replaced /etc/salt/pki/minion/minion_master.pub on eosnode-2 with the one from eosnode-1.
  3. Mention master as eosnode-1 in /etc/salt/minion configuration file.
  4. Add eosnode-1 and eosnode-2 entries to /etc/hosts files on both the nodes.
  5. Check that cluster.sls has proper indentation (2 spaces ahead of parent) for 2 nodes before provisioning components.

Problem: components.misc_pkgs.openldap is not idempotent

OpenLDAP installation fails if it is executed for the second time after first successful execution. The failure makes debugging of the following provisioner code more complicated than it should be.

cluster.sls:

cluster:
  type: single        # single/ees/ecs
  node_list:
    - eosnode-1
  eosnode-1:
    hostname: smc19-m10
    is_primary: true
    network:
      mgmt_if: eno1                   # Management network interfaces for bonding
      data_if: bond0                  # Management network interfaces for bonding
      gateway_ip:                     # No Implementation
    storage:
      metadata_device:                # Device for /var/mero and possibly SWAP
        - /dev/mapper/mpathn
      data_devices:                   # Data device/LUN from storage enclosure
        - /dev/mapper/mpatho
        - /dev/mapper/mpathb
        - /dev/mapper/mpathc
        - /dev/mapper/mpathd
        - /dev/mapper/mpathe
        - /dev/mapper/mpathf
        - /dev/mapper/mpathg
  storage_enclosure:
    id: storage_node_1            # equivalent to fqdn for server node
    type: 5U84                    # Type of enclosure. E.g. 5U84/PODS
    controller:
      type: gallium               # Type of controller on storage node. E.g. gallium/indium/sati
      primary_mc:
        ip: 127.0.0.1
        port: 8090
      secondary_mc:
        ip: 127.0.0.1
        port: 8090
      user: user
      password: 'passwd'

Versions:

> rpm -qi eos-prvsnr
Name        : eos-prvsnr
Version     : 1.0.0
Release     : 99_gitdcaf701_el7
Architecture: x86_64
Install Date: 2020-02-13T10:19:05 UTC
Group       : Tools
Size        : 524966
License     : Seagate
Signature   : (none)
Source RPM  : eos-prvsnr-1.0.0-99_gitdcaf701_el7.src.rpm
Build Date  : 2020-02-13T06:37:21 UTC
Build Host  : 4433bda62fc4
Relocations : (not relocatable)
URL         : http://gitlab.mero.colo.seagate.com/eos/provisioner/ees-prvsnr
Summary     : EOS Provisioning.
Description :
EOS Provisioning to deploy EOS Object storage software.
> rpm -qi eos-prvsnr-cli
Name        : eos-prvsnr-cli
Version     : 1.0.0
Release     : 99_gitdcaf701_el7
Architecture: x86_64
Install Date: 2020-02-13T10:16:33 UTC
Group       : Tools
Size        : 189956
License     : Seagate
Signature   : (none)
Source RPM  : eos-prvsnr-cli-1.0.0-99_gitdcaf701_el7.src.rpm
Build Date  : 2020-02-13T06:37:23 UTC
Build Host  : 4433bda62fc4
Relocations : (not relocatable)
URL         : http://gitlab.mero.colo.seagate.com/eos/provisioner/ees-prvsnr
Summary     : EOS Provisioner Command line interface.
Description :
EOS Provisioner Command line interface. Provides utilities to deploy EOS Object storage.

Logs for the first run

Logs from the second run

Problem: eosnode-* VMs don't see each other

I have created eosnode-1 and eosnode-2 VMs, following this procedure.

salt-key sees only one node:

[vagrant@eosnode-1 ~]$ sudo salt-key --list-all
Accepted Keys:
Denied Keys:
Unaccepted Keys:
eosnode-1
Rejected Keys:

/etc/hosts file is identical on both nodes:

[vagrant@eosnode-1 ~]$ cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

# Cluster
#172.19.10.101 eosnode-1-data s3.seagate.com sts.seagate.com iam.seagate.com
[vagrant@eosnode-2 ~]$ cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

# Cluster
#172.19.10.101 eosnode-1-data s3.seagate.com sts.seagate.com iam.seagate.com

Solution-1: modify Vagrantfile to fill /etc/hosts properly.
Solution-2: use hostmanager vagrant plugin.

salt-minion 2019.2.1 crashes

salt-minion 2019.2.1 doesn't respond, loads CPU a lot and has the errors in journalctl.
2019.2.0 worked fine. Seems updates in newer version indroduced some regression which appears provisioner pillar/formulas (I checked simple case of master-minion configuration without any salt formulas/pillars and it is not reproduced.)

# systemctl status salt-minion
● salt-minion.service - The Salt Minion
   Loaded: loaded (/usr/lib/systemd/system/salt-minion.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2019-09-27 03:56:59 IST; 2h 19min ago
     Docs: man:salt-minion(1)
           file:///usr/share/doc/salt/html/contents.html
           https://docs.saltstack.com/en/latest/contents.html
 Main PID: 3988 (salt-minion)
   CGroup: /system.slice/salt-minion.service
           ├─3988 /usr/bin/python3 -s /usr/bin/salt-minion
           ├─4007 /usr/bin/python3 -s /usr/bin/salt-minion
           └─4010 /usr/bin/python3 -s /usr/bin/salt-minion

Sep 27 06:16:14 eosnode-1 salt-minion[3988]: self.remove_periodic_callbback('ping', ping_master)
Sep 27 06:16:14 eosnode-1 salt-minion[3988]: AttributeError: 'Minion' object has no attribute 'remove_periodic_callbback'
Sep 27 06:16:15 eosnode-1 salt-minion[3988]: [ERROR   ] This Minion is already running. Not running Minion.tune_in()
Sep 27 06:16:15 eosnode-1 salt-minion[3988]: [CRITICAL] Unexpected error while connecting to localhost
Sep 27 06:16:15 eosnode-1 salt-minion[3988]: Traceback (most recent call last):
Sep 27 06:16:15 eosnode-1 salt-minion[3988]: File "/usr/lib/python3.6/site-packages/salt/minion.py", line 1027, in _connect_minion
Sep 27 06:16:15 eosnode-1 salt-minion[3988]: minion.tune_in(start=False)
Sep 27 06:16:15 eosnode-1 salt-minion[3988]: File "/usr/lib/python3.6/site-packages/salt/minion.py", line 2733, in tune_in
Sep 27 06:16:15 eosnode-1 salt-minion[3988]: self.remove_periodic_callbback('ping', ping_master)
Sep 27 06:16:15 eosnode-1 salt-minion[3988]: AttributeError: 'Minion' object has no attribute 'remove_periodic

Create a pool of generated config files

The config files generated during a provisioning of components are currently being kept under /tmp directory which later on get's deleted/overwritten during upgrade/teardown or as part of a housekeeping state of the components.
These files will be useful to maintain/monitor the state of the cluster/component after the cluster is provisioned.
The config files can be kept at: /opt/seagate/ees-prvsnr/generated_configs//

Problem: Ha component prepare and config states does not apply properly

‘prepare’ phase:

  • build-ees-ha-args.yaml does not have IP addresses set.
  • build-ees-ha-args.yaml has wrong node names.

‘config’ phase:

  • salt '*' state.apply components.ha.ees_ha.config runs build-ees-ha script on both nodes.
    The script should only be executed on “eosnode-1”.

deploye-eos fails to apply state `components.misc.build_ssl_cert_rpms` to primary node

deploy-eos script applies components.misc.build_ssl_cert_rpms state for both primary and secondary nodes of the cluster but a conditional logic inside the formulas makes that state non-existent for the primary (http://gitlab.mero.colo.seagate.com/eos/provisioner/ees-prvsnr/blob/master/srv/components/misc/build_ssl_cert_rpms/init.sls#L3)

# salt eosnode-1 state.apply components.misc.build_ssl_cert_rpms test=True
eosnode-1:
    Data failed to compile:
----------
    No matching sls found for 'components.misc.build_ssl_cert_rpms' in env 'base'

Problem: no space left in /var/mero after teardown

Steps to reproduce:

  1. Create a dev VM (eosnode-1).
  2. Execute eesmoke script.
    The script is expected to succeed.
  3. Execute sudo salt \* state.apply components.teardown.
    hctl shutdown will return non-zero, but this shouldn't be a problem.
  4. Try to run eesmoke again. The script will fail this time (see error message below).
eesmoke fails on 2nd run.
$ time /tmp/eesmoke || echo FAIL $?
[...]
                  [eosnode-1] bootstrap-node: Unable to start m0d@0x7200000000000001:0x9 service
              stdout:
                  2020-01-22 04:16:15: Generating cluster configuration... Ok.
                  2020-01-22 04:16:16: Starting Consul server agent on this node......... Ok.
                  2020-01-22 04:16:23: Importing configuration into the KV Store... Ok.
                  2020-01-22 04:16:23: Starting Consul agents on remaining cluster nodes... Ok.
                  2020-01-22 04:16:23: Update Consul agents configs from the KV Store... Ok.
                  2020-01-22 04:16:24: Install Mero configuration files... Ok.
                  2020-01-22 04:16:24: Waiting for the RC Leader to get elected.... Ok.
                  2020-01-22 04:16:25: Starting Mero (phase1, mkfs)... Ok.
                  2020-01-22 04:16:31: Starting Mero (phase1, m0d)...
----------
          ID: Stage - Test Hare
    Function: cmd.run
        Name: __slot__:salt:setup_conf.conf_cmd('/opt/seagate/eos/hare/conf/setup.yaml', 'hare:config')
      Result: True
     Comment: Command "" run
     Started: 04:18:37.347898
    Duration: 10.525 ms
     Changes:
              ----------
              pid:
                  14898
              retcode:
                  0
              stderr:
              stdout:

Summary for eosnode-1
------------
Succeeded: 6 (changed=6)
Failed:    1
------------
Total states run:     7
Total run time: 242.056 s
ERROR: Minions returned with non-zero exit code

real    12m25.217s
user    0m11.282s
sys     0m1.517s
FAIL 1

journalctl shows that there is no space left on device:

Jan 22 04:16:32 eosnode-1 mero-server[9439]: + exec /usr/bin/m0d -e lnet:172.19.10.101@tcp:12345:2:1 -f '<0x7200000000000001:0x9>' -T linux -S stobs -D db -A linuxstob:addb-stobs -m 65536 -q 16 -c /etc/mero/confd.xc -H 172.19.10.101@tcp:12345:1:1 -U
Jan 22 04:16:32 eosnode-1 mero-server[9439]: m0d: fallocate("m0trace.9439", 67112960): No space left on device
Jan 22 04:16:32 eosnode-1 mero-server[9439]: subsystem trace init failed: rc = -28
Jan 22 04:16:32 eosnode-1 mero-server[9439]: m0d:
Jan 22 04:16:32 eosnode-1 mero-server[9439]: Failed to initialise Mero

Indeed:

$ df -h /var/mero
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb2       991M  975M     0 100% /var/mero

Problem: How to specify parameters with arguments in setup.yaml?

build-ees-ha --ip1 172.19.10.103 --ip2 172.19.10.104 /var/lib/hare/cluster.yaml -i data0 --left-node eosnode-1 --right-node eosnode-2
build-ees-ha script takes parameters with variables. Present setup.yaml supports specifying arguments for any of the sections with scripts but does not mention about how can the variables be specified as in above build-ees-ha script.

Problem: Stage - Initialize Hare fails on ssh setup issue

Root cause: The script expects that something sets up passwordless root login to the same node. The stage fails with "Host key verification failed." message due to empty /root/.ssh/known_hosts file.

cluster.sls:

cluster:
  type: single        # single/ees/ecs
  node_list:
    - eosnode-1
  eosnode-1:
    hostname: smc19-m10
    is_primary: true
    network:
      mgmt_if: eno1                   # Management network interfaces for bonding
      data_if: bond0                  # Management network interfaces for bonding
      gateway_ip:                     # No Implementation
    storage:
      metadata_device:                # Device for /var/mero and possibly SWAP
        - /dev/mapper/mpathn
      data_devices:                   # Data device/LUN from storage enclosure
        - /dev/mapper/mpatho
        - /dev/mapper/mpathb
        - /dev/mapper/mpathc
        - /dev/mapper/mpathd
        - /dev/mapper/mpathe
        - /dev/mapper/mpathf
        - /dev/mapper/mpathg
  storage_enclosure:
    id: storage_node_1            # equivalent to fqdn for server node
    type: 5U84                    # Type of enclosure. E.g. 5U84/PODS
    controller:
      type: gallium               # Type of controller on storage node. E.g. gallium/indium/sati
      primary_mc:
        ip: 127.0.0.1
        port: 8090
      secondary_mc:
        ip: 127.0.0.1
        port: 8090
      user: user
      password: 'passwd'
520406@smc19-m10:~> facter --json hostname fqdn processorcount memorysize_mb ipaddress_bond0
{
  "hostname": "smc19-m10",
  "fqdn": "smc19-m10.mero.colo.seagate.com",
  "processorcount": 48,
  "memorysize_mb": "191890.84",
  "ipaddress_bond0": "172.16.0.105"
}
[root@smc19-m10 .ssh]# stat /root/.ssh/known_hosts
  File: ‘/root/.ssh/known_hosts’
  Size: 0         	Blocks: 0          IO Block: 4096   regular empty file
Device: fd00h/64768d	Inode: 269550664   Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2020-02-13 12:56:50.020410854 +0000
Modify: 2020-02-13 06:37:15.000000000 +0000
Change: 2020-02-13 10:16:33.283170230 +0000
 Birth: -
[root@smc19-m10 .ssh]# head -v -n 100 /root/.ssh/*
==> /root/.ssh/authorized_keys <==
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCg6WBK+qXSDIunMgCpxUgcTfMnSO/WLBi6UikIfvbCzcf0+8JkpT/4zbwX3tnT+6ayb6eY0Qo6MlP5qd2OQb4MxPRVDHW8F4o/WAV41+CLdojGPGFSuWyCUOg6GfCPz7n1nKJmuTQ6DpYddQR9s1kIPVjRXDXiAJ8a8FkhJqKOVYDXNMEJ01YNF9fLepWFj8aScw0HalvvvXVq4RNynEDOmqGmLOLzBDVw8XpbDPUNVUxMyUf97ObXRL9KgQIToNoTYte5liQ9jyk7qupRxzZQr3z/7wC80SUkbKvBtC4S0FmWIhvROwOwZVhcMFUT30KavACW0fRd9FOUOP+wxQVp1v5fQBiOoCnoYZbKRNZN/rQmcecIhdFZfTZnq2HFF0kpu17hVa2cMqSbvRstxVot+FpxAH9HaVUMDuOngRpxzLSCXzp8sxSRGAwG2dEKnL/zTk6NPv4ED3645X4zGRBfPpO7eFcd/qoRw5FKwyFZa3/zzE2hre12I+H0+LHCoTU= 714502@PUN-U714502L001

==> /root/.ssh/config <==
Host eosnode-1
    HostName eosnode-1
    User root
    UserKnownHostsFile /dev/null
    StrictHostKeyChecking no
    IdentityFile /root/.ssh/id_rsa_prvsnr
    IdentitiesOnly yes

Host eosnode-2
    HostName eosnode-2
    User root
    UserKnownHostsFile /dev/null
    StrictHostKeyChecking no
    IdentityFile /root/.ssh/id_rsa_prvsnr
    IdentitiesOnly yes

==> /root/.ssh/id_rsa_prvsnr <==
-----BEGIN OPENSSH PRIVATE KEY-----
b3BlbnNzaC1rZXktdjEAAAAABG5vbmUAAAAEbm9uZQAAAAAAAAABAAABlwAAAAdzc2gtcn
...
20+x/gO5FXISdjAAAAFjcxNDUwMkBQVU4tVTcxNDUwMkwwMDEBAgME
-----END OPENSSH PRIVATE KEY-----

==> /root/.ssh/id_rsa_prvsnr.pub <==
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCg6WBK+qXSDIunMgCpxUgcTfMnSO/WLBi6UikIfvbCzcf0+8JkpT/4zbwX3tnT+6ayb6eY0Qo6MlP5qd2OQb4MxPRVDHW8F4o/WAV41+CLdojGPGFSuWyCUOg6GfCPz7n1nKJmuTQ6DpYddQR9s1kIPVjRXDXiAJ8a8FkhJqKOVYDXNMEJ01YNF9fLepWFj8aScw0HalvvvXVq4RNynEDOmqGmLOLzBDVw8XpbDPUNVUxMyUf97ObXRL9KgQIToNoTYte5liQ9jyk7qupRxzZQr3z/7wC80SUkbKvBtC4S0FmWIhvROwOwZVhcMFUT30KavACW0fRd9FOUOP+wxQVp1v5fQBiOoCnoYZbKRNZN/rQmcecIhdFZfTZnq2HFF0kpu17hVa2cMqSbvRstxVot+FpxAH9HaVUMDuOngRpxzLSCXzp8sxSRGAwG2dEKnL/zTk6NPv4ED3645X4zGRBfPpO7eFcd/qoRw5FKwyFZa3/zzE2hre12I+H0+LHCoTU= 714502@PUN-U714502L001

==> /root/.ssh/known_hosts <==
[root@smc19-m10 .ssh]#
$ sudo salt eosnode-1 state.apply components.hare
...
<successful stages>
...
----------
          ID: Stage - Initialize Hare
    Function: cmd.run
        Name: __slot__:salt:setup_conf.conf_cmd('/opt/seagate/eos/hare/conf/setup.yaml', 'hare:init')
      Result: False
     Comment: Command "/opt/seagate/eos/hare/libexec/prov-init /var/lib/hare/cluster.yaml" run
     Started: 13:07:53.058719
    Duration: 367.186 ms
     Changes:
              ----------
              pid:
                  54630
              retcode:
                  1
              stderr:

                  ?[1;33mWarning?[0m: Could not locate a cache base directory from the environment.

                  You can provide a cache base directory by pointing the $XDG_CACHE_HOME
                  environment variable to a directory with read and write permissions.

                  Host key verification failed.
                  Traceback (most recent call last):
                    File "/opt/seagate/eos/hare/bin/../bin/cfgen", line 1397, in <module>
                      main()
                    File "/opt/seagate/eos/hare/bin/../bin/cfgen", line 121, in main
                      enrich_cluster_desc(cluster_desc, opts.mock)
                    File "/opt/seagate/eos/hare/bin/../bin/cfgen", line 215, in enrich_cluster_desc
                      ipaddr_key(node['data_iface']))
                    File "/opt/seagate/eos/hare/bin/../bin/cfgen", line 294, in get_facts
                      return json.loads(run_command(hostname, 'facter', '--json', *args))
                    File "/opt/seagate/eos/hare/bin/../bin/cfgen", line 288, in run_command
                      timeout=15).decode()
                    File "/usr/lib64/python3.6/subprocess.py", line 356, in check_output
                      **kwargs).stdout
                    File "/usr/lib64/python3.6/subprocess.py", line 438, in run
                      output=stdout, stderr=stderr)
                  subprocess.CalledProcessError: Command '['ssh', 'smc19-m10', 'facter', '--json', 'hostname', 'fqdn', 'processorcount', 'memorysize_mb', 'ipaddress_bond0']' returned non-zero exit status 255.
              stdout:
                  2020-02-13 13:07:53: Generating cluster configuration...
----------

Full log of the above

Problem: deploy-eos fails due to missing elasticsearch

Re-provisioning the system with these commands

salt \* state.apply components.teardown
vim /opt/seagate/eos-prvsnr/pillar/components/cluster.sls  # edit manually
/opt/seagate/eos-prvsnr/cli/deploy-eos

fails at "misc_pkgs.elasticsearch" stage of deploy-eos.

Error message:

----------
          ID: Start elasticsearch
    Function: service.running
        Name: elasticsearch
      Result: False
     Comment: The named service elasticsearch is not available
     Started: 13:17:55.373876
    Duration: 18.687 ms
     Changes:

Summary for eosnode-1
------------
Succeeded: 4 (changed=2)
Failed:    1
------------
Total states run:     5

Environment: smc7-m11/smc8-m11 cluster (actual hardware)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.