aicoe-aiops / ceph_drive_failure Goto Github PK

View Code? Open in Web Editor NEW

12.0 12.0 14.0 67.1 MB

An AI/ML solution that provides a probability that a hard drive will fail within some pre-defined time period.

License: GNU Lesser General Public License v2.1

Jupyter Notebook 10.65% HTML 89.33% Python 0.02%

ceph_drive_failure's People

Contributors

Stargazers

Watchers

Forkers

michaelclifford jian-zhang de-shui zphj1987 chauhankaranraj sankbad vaibhavkamat goern isabelizimm 4n4nd prognostika

ceph_drive_failure's Issues

Forecast Using Aggregated Features instead of Raw Time Series

Feedback no. 5

In the current forecasting notebook, we assumed that the maximum number of days of data that we are guaranteed to have at runtime is 6. However after talking to ceph subject matter experts, it seems that there might be some flexibility there.

It may be possible to have aggregated values describing time series behavior over a longer period of time, instead of having raw data. For example, consider SMART 5 values for device A. Instead of storing a vector

[100, 100, 100, 99, 95, 96]

representing the raw values from the last 6 days, we could instead store a vector

[(99.5, 0.24), (100, 0), (100, 0), (99.5, 0.2), (99.25, 0.1), (98.33, 0.56)]

where the first tuple is (mean,std) of SMART 5 in last 6 days, next tuple is (mean,std) of SMART 5 in the last days 6-12, and so on and so forth. This way we can describe last 36 days of behavior using 12 discrete values.

As a data scientist, I want to explore if it is possible to have a forecasting model predicting future values using such aggregated features as input, instead of raw values

Acceptance criteria:

EDA notebook exploring possible models with the above setup
Compare performance of models created with the above setup vs current setup
Compare performance of models created using different types of aggregated features - e.g. mean, std, min, max, entropy.

[Epic] Explore FAST Dataset

In this project we have primarily been working with the backblaze dataset, and more recently the ceph-telemetry dataset. These datasets mainly consist of SMART metrics collected from hard disks via the smartctl tool (although ceph-telemetry also contains quite a lot of metadata, in addition to SMART metrics).

However, some recent research suggests that incorporating disk performance and disk location data on top of SMART data can be valuable in analyzing disk health. Specifically, this paper claims to achieve improvements in disk failure prediction models, when using these additional metrics. If this is indeed true for our use cases as well, then ceph should also collect these metrics from their users as a part of ceph-telemetry, so that we can build better models.

In this epic, we will explore this FAST dataset and evaluate the tradeoffs between performance gain and overhead of collecting these metrics from users. This would help us determine the optimal set of additional features that ceph should collect from users, to get the maximum benefit (in terms of better disk health prediction models).

SMART Metrics Forecasting

As per the user story laid out by ceph team here, their device health prediction module should have two models - one to forecast future SMART metrics for a device, and one to predict if the device will fail.

We already created the latter type of model some time back and added it to upstream ceph (ceph/ceph#29437). Another PR (yaarith/ceph-telemetry#1) integrates this model to a public facing grafana dashboard, so that end users and SMEs can directly see model predictions and provide feedback.

However, we don't have any models for the SMART metrics forecasting yet.

Acceptance criteria:

EDA notebook to explore and compare forecasting models

Note: copying this issue from internal repo (aicoe-aiops/ceph-data-drive-failure#6) since issues can't be transferred from private to public repos

NameError: name 'rfe_dt_preds' is not defined for kaggle_hgst_end2end.ipynb

I was using this great end to end example, but met some issues at the last Mlflow logging part.

NameError: name 'rfe_dt_preds' is not defined, how to define rfe_dt_preds?

https://github.com/chauhankaranraj/ceph_drive_failure/blob/master/kaggle_hgst_end2end.ipynb

[EPIC] - Open Source Ceph Dataset

As a data scientist working on disk health prediction for ceph,
I want access to disk health data from actual users’ ceph clusters,
So that I can train models using this data, which would likely result in more accurate and precise models than those created with external data.

As a Ceph product manager,
I want to anonymize and publish ceph telemetry data,
So that I can build an open source data science community around this data.

Totally open to ideas on how to approach this. One idea for an MVP is, we can manually extract and store one quarter of data (in a similar fashion as Backblaze) to an s3 bucket on the operate-first cluster. And have a small jupyter notebook that shows how to access this data, and also does some light EDA on it. Then if this seems like a suitable way to share data, we can go ahead and automate the data publishing process.

Acceptance Criteria

public s3 bucket on operate-first - #32
add one quarter's worth of data to this bucket
data access + EDA notebook - #33
publish on operate-first website

Add data directories to repo

As a data scientist, I want to make sure all projects follow the same data saving conventions. Currently this repo does not follow the data directory structure laid out in the project-template repo. This needs to be fixed.

Acceptance critera

directories with .gitkeep added to repo
data saving/reading code in all relevant notebooks updated to use these directories

Ceph telemetry: EDA notebook

As a data scientist,
I want to access and explore with Ceph telemetry data,
So that I can use it for creating more accurate ML models for failure prediction and forecasting.

Once the initial dataset is available via #32, we can create a notebook to show how an external user can access it, and perform some initial EDA. This notebook can then be published to a blog (e.g. operate-first).

Acceptance criteria

Initial EDA notebook
Notebook published to operate first

Vendor Agnostic Models

Feedback no. 3

One salient feature of the Backblaze dataset is that the distribution of vendors in the data is neither uniform nor exhaustive. For example, seagate comprises ~70% of data, HGST comprises ~15%, Intel drive data is absent, etc. Also, our initial assumption was that SMART metrics may behave differently for different vendors. Therefore in the current forecasting notebook, models are trained vendor-wise. However, the distribution of vendors across Ceph users is likely different and we want to support all of those vendors.

As a data scientist, I want to explore how "transferable" forecasting models are, across vendors. That is, how is performance affected when a model is trained on data from one vendor and evaluated on data from another one.

Acceptance criteria:

EDA notebook comparing model performance on data from the vendor it's trained on and data from other vendors

Design Prediction Module Interface

As a data scientist,
I want to establish the API for the disk health prediction python module to be created by #27,
So that I can design the python module according to what kind of inputs to expect and what kind of output to return.

As a ceph developer,
I want to establish the interface between the ceph manager daemon and the disk health prediction python module (to be created by #27),
So that I can ensure ceph manager is able to provide the inputs needed by this module in the correct format, and is able to use the output provided by it.

Acceptance Criteria

doc outlining the interface added to docs directory

Repeat Forecasting Pipeline for Other Metrics

Feedback no. 1

In the current exploratory notebook for forecasting, the metric “reallocated sector count” (SMART 5) is being forecasted, and models are evaluated based on their predictive power on this metric. However, there are some other metrics that are also indicative of drive failure. For seagate drives these SMART metrics are: 187, 188, 197, and 198.

As a data scientist, I want to train and evaluate models used in the current notebook on these other metrics, to get a better understanding of model performances across different metrics.

Acceptance criteria

EDA notebook to rerun experiments in forecasting notebook with other metrics (listed above), and compare results

ADR for failure prediction model deployment

As a developer/maintainer of machine learning models in ceph, I want to understand the python package related issues and constraints in integrating these models in ceph.

Acceptance Criterion

ADR describing the current scikit-learn issue in detail, and what potential solutions could look like

Fix linting and formatting errors

As a data scientist, I want to ensure all the code in this repo is well formatted and follows best software quality practices. Since this repo was created way before we had aicoe-aiops/project-template set up, many of the notebooks and scripts have gotten merged into this repo without being properly linted or formatted. This needs to be fixed.

Acceptance Criterion

update notebooks and scripts so that all pre-commit tests are passing

Pull Embedded ML Code Out of Ceph Upstream and Create Standalone Module

As a data scientist,
I want to create a public python module to run inference using the models trained in this project,
So that these disk health prediction models are accessible and usable by a wider audience than just ceph users.

As a ceph product manager,
I want to use a separate, standalone python module for disk health prediction,
So that data preprocessing, feature selection, model selection, and other steps of the ML pipeline are decoupled from the ceph codebase. The ceph manager daemon should be able to just pass raw data to this module, and get health predictions in return.

Acceptance Criteria

standalone python module created
delivery pipeline is up and running on git tag creation @harshad16
python module delivered to pypi.org/AICoE/ @harshad16

FAST dataset EDA

As a data scientist,
I want to perform an initial EDA on the FAST dataset,
So that I understand the salient features of this dataset, and see how it compares to backblaze and ceph-telemetry datasets that I have worked with before. This should give me some idea for how to go about performing classification and regression tasks with this dataset.

This story is a part of epic #43

Acceptance Criteria

notebook showing how to access data, and performing initial EDA

Forecasting using FAST dataset

As a data scientist,
I want to run the previous forecasting experiments on the new FAST dataset,
So that I can

evaluate the performance of existing methods on a new dataset
quantify the marginal benefit (or any drawbacks) of having additional features available from the FAST dataset

This story is a part of epic #43

Acceptance Criteria

notebook training multivariate forecasting models trained on FAST dataset, and comparing them to models trained on backblaze dataset

Explore Different Lengths of Data Available

Feedback no. 4

On the one hand, having more amount of data available might improve model accuracy. But on the other hand, this would mean users have to store more health data locally. The main purpose of this issue is to figure out the “sweet spot” such that not a lot of data is stored and yet model performance is also improved.

As a data scientist, I want to explore how model performance changes with number of days of data available at runtime, to find a reasonable compromise between amount of data stored and model accuracy achieved.

Acceptance criteria:

EDA notebook showing effect of number of days of data on model accuracy

Multivariate Time Series Forecasting

Feedback no. 2

The current forecasting notebook has a univariate time series forecasting setup. That is, only one SMART metric forecasting at a time, independent of the other metrics. However, it is likely that there are interactions / dependencies across SMART metrics.

As a data scientist, I want to explore training multivariate time series forecasting models, which can capture interactions between various SMART metrics and possibly improve forecasting power.

Acceptance criteria

EDA notebook to train multivariate models and compare performance with univariate models

Ceph telemetry: setup s3 bucket on MOC and add initial dataset

As a Ceph product manager,
I want a publicly accessible long term storage space,
So that I can regularly upload Ceph telemetry data there and external users can access it easily.

As per the discussion in operate-first/support#198, we requested an account on MOC. We now need to figure out how to set up an s3 bucket associated with this account. Since we have not worked with OpenStack before, we will need to open a ticket and start a thread with MOC folks on how to do this.

Acceptance Criteria

S3 bucket created
Initial dataset uploaded

aicoe-aiops / ceph_drive_failure Goto Github PK

ceph_drive_failure's People

Contributors

Stargazers

Watchers

Forkers

ceph_drive_failure's Issues

Recommend Projects

Recommend Topics

Recommend Org