comcast / bynar Goto Github PK
View Code? Open in Web Editor NEWServer remediation as a service
License: Apache License 2.0
Server remediation as a service
License: Apache License 2.0
The error handling in bynar could be improved. A lot of the errors are converted to String which sometimes loses information about what the actual error is. There are a few options that could improve the situation:
I was checking out this today: https://docs.rs/hdd/0.10.0/hdd/ and this might actually be better at identifying disks than udev. I've seen linux get into situations where udev drops the device from the tree but it's still mounted in the filesystem tree. I wonder if this would help with that problem.
Since the ByNar is running as binary (agent) in the system, what happens on the following scenario?
When system goes off then the agent goes off as the agent is running on the system which should be healthy to execute the monitoring.
Possible Solution:
Possible issue again on the solution:
Just throwing my thoughts so not miss. :)
Testing bynar is challenging to say the least. If possible I wonder if a virtual machine could be created that allows integration testing to be performed over and over to validate logic correctness.
It's hard to know what bynar is doing without some kind of statistics being saved somewhere.
These need to be added:
liblvm2-dev
gcc
test_disk.rs needs a state machine and src/main.rs also needs a state_machine
Hoverbear has great docs on this: https://hoverbear.org/2016/10/12/rust-state-machine-pattern/
Don't want the logs to grow forever and fill /var/log update
Some ceph clusters can't handle having disks added quickly. This task would see them sharded in slowly.
Bynar is currently standalone. Each system runs Bynar as a service and communicates with a database to log its operations. An enhancement is to make it a peer-to-peer application. This would enable to heal systems that are otherwise unreachable due a network card failures or to reduce the workload on a busy system. Every system where Bynar is installed should be able to communicate with others in the peer network. The communication could potentially be limited to only systems within a particular network zone or to systems within a storage cluster. Alternatively, it could be limited to systems that are using the same underlying storage technology. There is a RUST library that can be explored to accomplish this.
Anything better we can come up with besides just disabling the cron job while maintenance is being performed on a server?
main.rs: line 261 - when state machine block device state is Failed, log that information to the database by calling save_state().
Right now the code doesn't make use of the journal information passed to add_disk.
How do clients consume the API?
It would be nice to restrict what commands can be run on the cli. This is partly taken care of by linux permissions. Maybe that's good enough, not sure yet.
The code needs a lot more code comments to help newcomers.
Could libfiu be used to test bynar? https://blitiri.com.ar/p/libfiu/
Add systemd files. Use example from Ceph-usage
Make the naming of everything consistent
main.rs::287 -- OperationInfo struct created with device_id set to 0. This will then create an operation entry in the device that doesn't correspond to any device and will essentially be a zombie and not found when looking for outstanding operations (since get_outstanding_tickets() will join operations table with the devices table).
Also change the struct to accept only valid input upon instantiation.
Hi Chris! Could you please include a contributing guide so people know how to get involved?
Thanks so much.
High level steps for removing an entire server from the cluster.
add_or_update_operation() needs to be called to a) evaluate and set the completed/done time b) track other operations like diskadd, disk remove etc for each hardware that is remediated.
It would be great if it updated tickets to talk about what actions it's taken.
https://gluster.readthedocs.io/en/latest/Administrator%20Guide/Managing%20Volumes/#replace-brick
Have bynar take these steps to swap a disk for glusterfs. The volume type will need to be ascertained before following the appropriate set of steps.
In testing with a fresh ubuntu 18.04 i noticed that bynar tried to mark my root disk for replacement. It didn't bother to check that the disk had mounted partitions. This needs fixing.
It would be nice if there was a database column to show how long a scan took which could help identify machines where bynar is crashing or taking forever.
Track the latency of disks and kick slow ones out of the cluster. This can kill a cluster if disks are slow but not dead. Benchmark the disks and keep a log somewhere. If 5 times in a row of bad results, kick the disk out of the cluster
The provided platforms (Slack and JIRA) can be used if owner meets requirements. Possible to develop adapters for notification services that allow user to either develop, or send constructed .json to any endpoint?
This would prove valuable for different types of gear that might lie underneath. Construct the errors and process, and hand that off to the adapters or modules for the vendor. Specify those in the configuration, with a set of static configurations to be passed. Pointers can be used or objects, or simply the functions there of.
The idea here, is that no matter the call or notification we receive the same message each time, which can be passed to the necessary next stage. It should help also, if you have a noisy or something that sends multiple messages. Inherent by the definition, but the idea is, if we receive the the alert multiple times, don't create multiple tickets.
Write some protobuf API's for hp ticket integration
We often see issues with interface goes down in a bonding configuration which can be simply brought up/fix by ifup/ifdown (60 to 70%) of the time.
if not up,
file a ticket to do the following.
Exposing the database over the api will allow others to integrate easier.
A rust redfish library doesn't exist yet but having one would allow power cycling servers without ssh access. It would also allow bios remediation and a host of other things. The library could probably be generated from schema files: https://redfish.dmtf.org/index.php?q=redfish/schema_index
While it's great to have everything written as rust code it might also be nice for end customers to be able to create workflow pipelines without writing rust. This would allow customers that don't have programming knowledge to automate their drive change process.
Linux has an issue where if a disk fails it often gets removed from the file tree. That makes it impossible to know if the disk doesn't exist or it just failed unless we previously recorded it somewhere. The goal of this task is to modify the check_all_disks
fn in the test_disk.rs file and have it communicate with the database to check if there's any missing disks. If they are missing can we assume they are failed and need intervention?
With enough threads it seems that LVM init segfaults for some unknown reason.
```00:16:10 [DEBUG] postgres: preparing query with name ``: UPDATE hardware SET state = 'unscanned' WHERE device_id=2
00:16:10 [DEBUG] postgres: executing query: COMMIT
00:16:10 [DEBUG] bynar::test_disk: thread 2993564 Transition failed. Trying next transition
00:16:10 [DEBUG] lvm: dropping lvm
Thread 28 "bynar" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe33f9700 (LWP 2993637)]
0x00007ffff7683607 in dm_pool_alloc_aligned () from /lib/x86_64-linux-gnu/libdevmapper.so.1.02.1
(gdb) bt
#0 0x00007ffff7683607 in dm_pool_alloc_aligned () from /lib/x86_64-linux-gnu/libdevmapper.so.1.02.1
#1 0x00007ffff7683a5e in dm_pool_zalloc () from /lib/x86_64-linux-gnu/libdevmapper.so.1.02.1
#2 0x00007ffff78cef48 in ?? () from /lib/x86_64-linux-gnu/liblvm2app.so.2.2
#3 0x00007ffff78c3cae in ?? () from /lib/x86_64-linux-gnu/liblvm2app.so.2.2
#4 0x00007ffff78c755e in ?? () from /lib/x86_64-linux-gnu/liblvm2app.so.2.2
#5 0x00007ffff78b46a0 in lvm_init () from /lib/x86_64-linux-gnu/liblvm2app.so.2.2
#6 0x0000555556284acc in lvm::Lvm::new::h70e76aef47c8d063 (system_dir=Option<&str>) at lvm/src/lib.rs:450
#7 0x0000555555c2a5b0 in bynar::test_disk::is_disk_blank::h6e8f871b18876247 (dev=0x7fff74001320) at src/test_disk.rs:1586
#8 0x0000555555c19802 in _$LT$bynar..test_disk..Eval$u20$as$u20$bynar..test_disk..Transition$GT$::transition::h985d5930d2e53f8b (
to_state=Good, device=0x7fffe33f2d48, _scsi_info=0x7fffe33f2ed8, _simulate=false) at src/test_disk.rs:449
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.