comcast / bynar Goto Github PK

View Code? Open in Web Editor NEW

55.0 55.0 13.0 4.01 MB

Server remediation as a service

License: Apache License 2.0

Rust 96.18% Shell 0.70% PLpgSQL 3.12%

ceph protobuf rust zeromq

bynar's People

Contributors

Stargazers

Watchers

Forkers

pkanna000 tyagian sysbot dalavancloud lcao0319 mzhong1 vinutha17 jordanfeldman iamjarvo mbhask000 kaivnit stdigitaldatacenter ghas-results

bynar's Issues

Improve Error Handling

The error handling in bynar could be improved. A lot of the errors are converted to String which sometimes loses information about what the actual error is. There are a few options that could improve the situation:

https://rust-lang-nursery.github.io/error-chain/error_chain/index.html
https://boats.gitlab.io/failure/fail.html ( which might be better I heard )
https://github.com/rushmorem/derive-error ( which looks to be about what i'd do manually) This could be the easiest option if it works.

hdd library

I was checking out this today: https://docs.rs/hdd/0.10.0/hdd/ and this might actually be better at identifying disks than udev. I've seen linux get into situations where udev drops the device from the tree but it's still mounted in the filesystem tree. I wonder if this would help with that problem.

What if a systems fails with panic/dead or partially dead?

Since the ByNar is running as binary (agent) in the system, what happens on the following scenario?

Kernel panic
System rebooted, not up?
Someone stopped the agent and not restarted?
Partially died due to hardware (memory, cpu, raid...)

When system goes off then the agent goes off as the agent is running on the system which should be healthy to execute the monitoring.

Possible Solution:

Client/Server Architecture ?
Peer to Peer monitoring (ex. CEPH OSDs)?

Possible issue again on the solution:

Client / Server architecture needs administrative overhead, fail over, firewall, DR, certs, LB and redundancy....
Peer to Peer - Message broadcasting or streamlined/narrow down approach. Example, A failed system should be monitored only by the neighbors? A system before and after the sequence ?

Just throwing my thoughts so not miss. :)

Virtual Machine Test Env

Testing bynar is challenging to say the least. If possible I wonder if a virtual machine could be created that allows integration testing to be performed over and over to validate logic correctness.

Add statistics about ongoing operations to timescaledb

It's hard to know what bynar is doing without some kind of statistics being saved somewhere.

documentation changes

These need to be added:
liblvm2-dev
gcc

State machine

test_disk.rs needs a state machine and src/main.rs also needs a state_machine
Hoverbear has great docs on this: https://hoverbear.org/2016/10/12/rust-state-machine-pattern/

Setup log rotation for the bynar logs

Don't want the logs to grow forever and fill /var/log update

Slowly add new disks

Some ceph clusters can't handle having disks added quickly. This task would see them sharded in slowly.

make Bynar peer to peer

Bynar is currently standalone. Each system runs Bynar as a service and communicates with a database to log its operations. An enhancement is to make it a peer-to-peer application. This would enable to heal systems that are otherwise unreachable due a network card failures or to reduce the workload on a busy system. Every system where Bynar is installed should be able to communicate with others in the peer network. The communication could potentially be limited to only systems within a particular network zone or to systems within a storage cluster. Alternatively, it could be limited to systems that are using the same underlying storage technology. There is a RUST library that can be explored to accomplish this.

Maintenance Mode

Anything better we can come up with besides just disabling the cron job while maintenance is being performed on a server?

log state machine failures to database

main.rs: line 261 - when state machine block device state is Failed, log that information to the database by calling save_state().

Add journal to bluestore osds

Right now the code doesn't make use of the journal information passed to add_disk.

Remove mktemp because of MPL-2 license

Document the protocol buffer api

How do clients consume the API?

Different permission levels

It would be nice to restrict what commands can be run on the cli. This is partly taken care of by linux permissions. Maybe that's good enough, not sure yet.

Document code

The code needs a lot more code comments to help newcomers.

Fault Injection

Could libfiu be used to test bynar? https://blitiri.com.ar/p/libfiu/

Systemd files

Add systemd files. Use example from Ceph-usage

clean up links to ceph_dead_disk

Make the naming of everything consistent

Bug: OperationInfo::new() created with device_id set to 0

main.rs::287 -- OperationInfo struct created with device_id set to 0. This will then create an operation entry in the device that doesn't correspond to any device and will essentially be a zombie and not found when looking for outstanding operations (since get_outstanding_tickets() will join operations table with the devices table).
Also change the struct to accept only valid input upon instantiation.

Needs a contributing guide

Hi Chris! Could you please include a contributing guide so people know how to get involved?
Thanks so much.

Remove entire server from rotation

High level steps for removing an entire server from the cluster.

fine tune tracking of individual operations

add_or_update_operation() needs to be called to a) evaluate and set the completed/done time b) track other operations like diskadd, disk remove etc for each hardware that is remediated.

Update jira tickets

It would be great if it updated tickets to talk about what actions it's taken.

Glusterfs Replace brick

https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Managing%20Volumes/#replace-brick
Have bynar take these steps to swap a disk for glusterfs. The volume type will need to be ascertained before following the appropriate set of steps.

Bynar doesn't understand root disks

In testing with a fresh ubuntu 18.04 i noticed that bynar tried to mark my root disk for replacement. It didn't bother to check that the disk had mounted partitions. This needs fixing.

New database column to show how long running

It would be nice if there was a database column to show how long a scan took which could help identify machines where bynar is crashing or taking forever.

Kick slow disks

Track the latency of disks and kick slow ones out of the cluster. This can kill a cluster if disks are slow but not dead. Benchmark the disks and keep a log somewhere. If 5 times in a row of bad results, kick the disk out of the cluster

Idempotence for notifications

The provided platforms (Slack and JIRA) can be used if owner meets requirements. Possible to develop adapters for notification services that allow user to either develop, or send constructed .json to any endpoint?

This would prove valuable for different types of gear that might lie underneath. Construct the errors and process, and hand that off to the adapters or modules for the vendor. Specify those in the configuration, with a set of static configurations to be passed. Pointers can be used or objects, or simply the functions there of.

The idea here, is that no matter the call or notification we receive the same message each time, which can be passed to the necessary next stage. It should help also, if you have a noisy or something that sends multiple messages. Inherent by the definition, but the idea is, if we receive the the alert multiple times, don't create multiple tickets.

HP integration

Write some protobuf API's for hp ticket integration

Auto remediate bonding issue

We often see issues with interface goes down in a bonding configuration which can be simply brought up/fix by ifup/ifdown (60 to 70%) of the time.

validation of OS
check bonding cat /proc/net/bonding/bond1 (active / passive - based on the type of bonding configuration)
ifdown
ifup
validate.

if not up,

file a ticket to do the following.

check the link on the switch port, if not green engage network team. If that is not fixing,
clean the SFP/reseat it, if not fixing
change the SFP, if not fixing
change the fiber, if not fixing
Raise a Vendor case

Expose the database over the api

Exposing the database over the api will allow others to integrate easier.

Redfish integration

A rust redfish library doesn't exist yet but having one would allow power cycling servers without ssh access. It would also allow bios remediation and a host of other things. The library could probably be generated from schema files: https://redfish.dmtf.org/index.php?q=redfish/schema_index

Flexible workflow specification

While it's great to have everything written as rust code it might also be nice for end customers to be able to create workflow pipelines without writing rust. This would allow customers that don't have programming knowledge to automate their drive change process.

Introduce authentication to the disk-manager to prevent unauthorized use

Inspect the database for missing disks

Linux has an issue where if a disk fails it often gets removed from the file tree. That makes it impossible to know if the disk doesn't exist or it just failed unless we previously recorded it somewhere. The goal of this task is to modify the check_all_disks fn in the test_disk.rs file and have it communicate with the database to check if there's any missing disks. If they are missing can we assume they are failed and need intervention?

LVM segfault issue

With enough threads it seems that LVM init segfaults for some unknown reason.
```00:16:10 [DEBUG] postgres: preparing query with name ``: UPDATE hardware SET state = 'unscanned' WHERE device_id=2
00:16:10 [DEBUG] postgres: executing query: COMMIT
00:16:10 [DEBUG] bynar::test_disk: thread 2993564 Transition failed. Trying next transition
00:16:10 [DEBUG] lvm: dropping lvm

Thread 28 "bynar" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe33f9700 (LWP 2993637)]
0x00007ffff7683607 in dm_pool_alloc_aligned () from /lib/x86_64-linux-gnu/libdevmapper.so.1.02.1
(gdb) bt
#0 0x00007ffff7683607 in dm_pool_alloc_aligned () from /lib/x86_64-linux-gnu/libdevmapper.so.1.02.1
#1 0x00007ffff7683a5e in dm_pool_zalloc () from /lib/x86_64-linux-gnu/libdevmapper.so.1.02.1
#2 0x00007ffff78cef48 in ?? () from /lib/x86_64-linux-gnu/liblvm2app.so.2.2
#3 0x00007ffff78c3cae in ?? () from /lib/x86_64-linux-gnu/liblvm2app.so.2.2
#4 0x00007ffff78c755e in ?? () from /lib/x86_64-linux-gnu/liblvm2app.so.2.2
#5 0x00007ffff78b46a0 in lvm_init () from /lib/x86_64-linux-gnu/liblvm2app.so.2.2
#6 0x0000555556284acc in lvm::Lvm::new::h70e76aef47c8d063 (system_dir=Option<&str>) at lvm/src/lib.rs:450
#7 0x0000555555c2a5b0 in bynar::test_disk::is_disk_blank::h6e8f871b18876247 (dev=0x7fff74001320) at src/test_disk.rs:1586
#8 0x0000555555c19802 in _$LT$bynar..test_disk..Eval$u20$as$u20$bynar..test_disk..Transition$GT$::transition::h985d5930d2e53f8b (
to_state=Good, device=0x7fffe33f2d48, _scsi_info=0x7fffe33f2ed8, _simulate=false) at src/test_disk.rs:449