Giter Site home page Giter Site logo

bynar's People

Contributors

cholcombe973 avatar comcastrp avatar johnriv avatar kristakhare avatar mbhask000 avatar mzhong1 avatar sdandam avatar shillasaebi avatar vinutha17 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bynar's Issues

Improve Error Handling

The error handling in bynar could be improved. A lot of the errors are converted to String which sometimes loses information about what the actual error is. There are a few options that could improve the situation:

  1. https://rust-lang-nursery.github.io/error-chain/error_chain/index.html
  2. https://boats.gitlab.io/failure/fail.html ( which might be better I heard )
  3. https://github.com/rushmorem/derive-error ( which looks to be about what i'd do manually) This could be the easiest option if it works.

hdd library

I was checking out this today: https://docs.rs/hdd/0.10.0/hdd/ and this might actually be better at identifying disks than udev. I've seen linux get into situations where udev drops the device from the tree but it's still mounted in the filesystem tree. I wonder if this would help with that problem.

What if a systems fails with panic/dead or partially dead?

Since the ByNar is running as binary (agent) in the system, what happens on the following scenario?

  • Kernel panic
  • System rebooted, not up?
  • Someone stopped the agent and not restarted?
  • Partially died due to hardware (memory, cpu, raid...)

When system goes off then the agent goes off as the agent is running on the system which should be healthy to execute the monitoring.

Possible Solution:

  • Client/Server Architecture ?
  • Peer to Peer monitoring (ex. CEPH OSDs)?

Possible issue again on the solution:

  • Client / Server architecture needs administrative overhead, fail over, firewall, DR, certs, LB and redundancy....
  • Peer to Peer - Message broadcasting or streamlined/narrow down approach. Example, A failed system should be monitored only by the neighbors? A system before and after the sequence ?

Just throwing my thoughts so not miss. :)

Virtual Machine Test Env

Testing bynar is challenging to say the least. If possible I wonder if a virtual machine could be created that allows integration testing to be performed over and over to validate logic correctness.

Slowly add new disks

Some ceph clusters can't handle having disks added quickly. This task would see them sharded in slowly.

make Bynar peer to peer

Bynar is currently standalone. Each system runs Bynar as a service and communicates with a database to log its operations. An enhancement is to make it a peer-to-peer application. This would enable to heal systems that are otherwise unreachable due a network card failures or to reduce the workload on a busy system. Every system where Bynar is installed should be able to communicate with others in the peer network. The communication could potentially be limited to only systems within a particular network zone or to systems within a storage cluster. Alternatively, it could be limited to systems that are using the same underlying storage technology. There is a RUST library that can be explored to accomplish this.

Maintenance Mode

Anything better we can come up with besides just disabling the cron job while maintenance is being performed on a server?

Different permission levels

It would be nice to restrict what commands can be run on the cli. This is partly taken care of by linux permissions. Maybe that's good enough, not sure yet.

Document code

The code needs a lot more code comments to help newcomers.

Bug: OperationInfo::new() created with device_id set to 0

main.rs::287 -- OperationInfo struct created with device_id set to 0. This will then create an operation entry in the device that doesn't correspond to any device and will essentially be a zombie and not found when looking for outstanding operations (since get_outstanding_tickets() will join operations table with the devices table).
Also change the struct to accept only valid input upon instantiation.

Needs a contributing guide

Hi Chris! Could you please include a contributing guide so people know how to get involved?
Thanks so much.

fine tune tracking of individual operations

add_or_update_operation() needs to be called to a) evaluate and set the completed/done time b) track other operations like diskadd, disk remove etc for each hardware that is remediated.

Update jira tickets

It would be great if it updated tickets to talk about what actions it's taken.

Bynar doesn't understand root disks

In testing with a fresh ubuntu 18.04 i noticed that bynar tried to mark my root disk for replacement. It didn't bother to check that the disk had mounted partitions. This needs fixing.

Kick slow disks

Track the latency of disks and kick slow ones out of the cluster. This can kill a cluster if disks are slow but not dead. Benchmark the disks and keep a log somewhere. If 5 times in a row of bad results, kick the disk out of the cluster

Idempotence for notifications

The provided platforms (Slack and JIRA) can be used if owner meets requirements. Possible to develop adapters for notification services that allow user to either develop, or send constructed .json to any endpoint?

This would prove valuable for different types of gear that might lie underneath. Construct the errors and process, and hand that off to the adapters or modules for the vendor. Specify those in the configuration, with a set of static configurations to be passed. Pointers can be used or objects, or simply the functions there of.

The idea here, is that no matter the call or notification we receive the same message each time, which can be passed to the necessary next stage. It should help also, if you have a noisy or something that sends multiple messages. Inherent by the definition, but the idea is, if we receive the the alert multiple times, don't create multiple tickets.

Auto remediate bonding issue

We often see issues with interface goes down in a bonding configuration which can be simply brought up/fix by ifup/ifdown (60 to 70%) of the time.

  1. validation of OS
  2. check bonding cat /proc/net/bonding/bond1 (active / passive - based on the type of bonding configuration)
  3. ifdown
  4. ifup
  5. validate.

if not up,

file a ticket to do the following.

  1. check the link on the switch port, if not green engage network team. If that is not fixing,
  2. clean the SFP/reseat it, if not fixing
  3. change the SFP, if not fixing
  4. change the fiber, if not fixing
  5. Raise a Vendor case

Flexible workflow specification

While it's great to have everything written as rust code it might also be nice for end customers to be able to create workflow pipelines without writing rust. This would allow customers that don't have programming knowledge to automate their drive change process.

Inspect the database for missing disks

Linux has an issue where if a disk fails it often gets removed from the file tree. That makes it impossible to know if the disk doesn't exist or it just failed unless we previously recorded it somewhere. The goal of this task is to modify the check_all_disks fn in the test_disk.rs file and have it communicate with the database to check if there's any missing disks. If they are missing can we assume they are failed and need intervention?

LVM segfault issue

With enough threads it seems that LVM init segfaults for some unknown reason.
```00:16:10 [DEBUG] postgres: preparing query with name ``: UPDATE hardware SET state = 'unscanned' WHERE device_id=2
00:16:10 [DEBUG] postgres: executing query: COMMIT
00:16:10 [DEBUG] bynar::test_disk: thread 2993564 Transition failed. Trying next transition
00:16:10 [DEBUG] lvm: dropping lvm

Thread 28 "bynar" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe33f9700 (LWP 2993637)]
0x00007ffff7683607 in dm_pool_alloc_aligned () from /lib/x86_64-linux-gnu/libdevmapper.so.1.02.1
(gdb) bt
#0 0x00007ffff7683607 in dm_pool_alloc_aligned () from /lib/x86_64-linux-gnu/libdevmapper.so.1.02.1
#1 0x00007ffff7683a5e in dm_pool_zalloc () from /lib/x86_64-linux-gnu/libdevmapper.so.1.02.1
#2 0x00007ffff78cef48 in ?? () from /lib/x86_64-linux-gnu/liblvm2app.so.2.2
#3 0x00007ffff78c3cae in ?? () from /lib/x86_64-linux-gnu/liblvm2app.so.2.2
#4 0x00007ffff78c755e in ?? () from /lib/x86_64-linux-gnu/liblvm2app.so.2.2
#5 0x00007ffff78b46a0 in lvm_init () from /lib/x86_64-linux-gnu/liblvm2app.so.2.2
#6 0x0000555556284acc in lvm::Lvm::new::h70e76aef47c8d063 (system_dir=Option<&str>) at lvm/src/lib.rs:450
#7 0x0000555555c2a5b0 in bynar::test_disk::is_disk_blank::h6e8f871b18876247 (dev=0x7fff74001320) at src/test_disk.rs:1586
#8 0x0000555555c19802 in _$LT$bynar..test_disk..Eval$u20$as$u20$bynar..test_disk..Transition$GT$::transition::h985d5930d2e53f8b (
to_state=Good, device=0x7fffe33f2d48, _scsi_info=0x7fffe33f2ed8, _simulate=false) at src/test_disk.rs:449

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.