aicoe-aiops / ceph_drive_failure Goto Github PK
View Code? Open in Web Editor NEWAn AI/ML solution that provides a probability that a hard drive will fail within some pre-defined time period.
License: GNU Lesser General Public License v2.1
An AI/ML solution that provides a probability that a hard drive will fail within some pre-defined time period.
License: GNU Lesser General Public License v2.1
Feedback no. 5
In the current forecasting notebook, we assumed that the maximum number of days of data that we are guaranteed to have at runtime is 6. However after talking to ceph subject matter experts, it seems that there might be some flexibility there.
It may be possible to have aggregated values describing time series behavior over a longer period of time, instead of having raw data. For example, consider SMART 5 values for device A. Instead of storing a vector
[100, 100, 100, 99, 95, 96]
representing the raw values from the last 6 days, we could instead store a vector
[(99.5, 0.24), (100, 0), (100, 0), (99.5, 0.2), (99.25, 0.1), (98.33, 0.56)]
where the first tuple is (mean,std) of SMART 5 in last 6 days, next tuple is (mean,std) of SMART 5 in the last days 6-12, and so on and so forth. This way we can describe last 36 days of behavior using 12 discrete values.
As a data scientist, I want to explore if it is possible to have a forecasting model predicting future values using such aggregated features as input, instead of raw values
Acceptance criteria:
In this project we have primarily been working with the backblaze dataset, and more recently the ceph-telemetry dataset. These datasets mainly consist of SMART metrics collected from hard disks via the smartctl
tool (although ceph-telemetry also contains quite a lot of metadata, in addition to SMART metrics).
However, some recent research suggests that incorporating disk performance and disk location data on top of SMART data can be valuable in analyzing disk health. Specifically, this paper claims to achieve improvements in disk failure prediction models, when using these additional metrics. If this is indeed true for our use cases as well, then ceph should also collect these metrics from their users as a part of ceph-telemetry, so that we can build better models.
In this epic, we will explore this FAST dataset and evaluate the tradeoffs between performance gain and overhead of collecting these metrics from users. This would help us determine the optimal set of additional features that ceph should collect from users, to get the maximum benefit (in terms of better disk health prediction models).
As per the user story laid out by ceph team here, their device health prediction module should have two models - one to forecast future SMART metrics for a device, and one to predict if the device will fail.
We already created the latter type of model some time back and added it to upstream ceph (ceph/ceph#29437). Another PR (yaarith/ceph-telemetry#1) integrates this model to a public facing grafana dashboard, so that end users and SMEs can directly see model predictions and provide feedback.
However, we don't have any models for the SMART metrics forecasting yet.
Acceptance criteria:
Note: copying this issue from internal repo (aicoe-aiops/ceph-data-drive-failure#6) since issues can't be transferred from private to public repos
I was using this great end to end example, but met some issues at the last Mlflow logging part.
NameError: name 'rfe_dt_preds' is not defined, how to define rfe_dt_preds?
https://github.com/chauhankaranraj/ceph_drive_failure/blob/master/kaggle_hgst_end2end.ipynb
As a data scientist working on disk health prediction for ceph,
I want access to disk health data from actual users’ ceph clusters,
So that I can train models using this data, which would likely result in more accurate and precise models than those created with external data.
As a Ceph product manager,
I want to anonymize and publish ceph telemetry data,
So that I can build an open source data science community around this data.
Totally open to ideas on how to approach this. One idea for an MVP is, we can manually extract and store one quarter of data (in a similar fashion as Backblaze) to an s3 bucket on the operate-first cluster. And have a small jupyter notebook that shows how to access this data, and also does some light EDA on it. Then if this seems like a suitable way to share data, we can go ahead and automate the data publishing process.
Acceptance Criteria
As a data scientist, I want to make sure all projects follow the same data saving conventions. Currently this repo does not follow the data directory structure laid out in the project-template repo. This needs to be fixed.
Acceptance critera
.gitkeep
added to repoAs a data scientist,
I want to access and explore with Ceph telemetry data,
So that I can use it for creating more accurate ML models for failure prediction and forecasting.
Once the initial dataset is available via #32, we can create a notebook to show how an external user can access it, and perform some initial EDA. This notebook can then be published to a blog (e.g. operate-first).
Acceptance criteria
Feedback no. 3
One salient feature of the Backblaze dataset is that the distribution of vendors in the data is neither uniform nor exhaustive. For example, seagate comprises ~70% of data, HGST comprises ~15%, Intel drive data is absent, etc. Also, our initial assumption was that SMART metrics may behave differently for different vendors. Therefore in the current forecasting notebook, models are trained vendor-wise. However, the distribution of vendors across Ceph users is likely different and we want to support all of those vendors.
As a data scientist, I want to explore how "transferable" forecasting models are, across vendors. That is, how is performance affected when a model is trained on data from one vendor and evaluated on data from another one.
Acceptance criteria:
As a data scientist,
I want to establish the API for the disk health prediction python module to be created by #27,
So that I can design the python module according to what kind of inputs to expect and what kind of output to return.
As a ceph developer,
I want to establish the interface between the ceph manager daemon and the disk health prediction python module (to be created by #27),
So that I can ensure ceph manager is able to provide the inputs needed by this module in the correct format, and is able to use the output provided by it.
Acceptance Criteria
docs
directoryFeedback no. 1
In the current exploratory notebook for forecasting, the metric “reallocated sector count” (SMART 5) is being forecasted, and models are evaluated based on their predictive power on this metric. However, there are some other metrics that are also indicative of drive failure. For seagate drives these SMART metrics are: 187, 188, 197, and 198.
As a data scientist, I want to train and evaluate models used in the current notebook on these other metrics, to get a better understanding of model performances across different metrics.
Acceptance criteria
As a developer/maintainer of machine learning models in ceph, I want to understand the python package related issues and constraints in integrating these models in ceph.
Acceptance Criterion
As a data scientist, I want to ensure all the code in this repo is well formatted and follows best software quality practices. Since this repo was created way before we had aicoe-aiops/project-template set up, many of the notebooks and scripts have gotten merged into this repo without being properly linted or formatted. This needs to be fixed.
Acceptance Criterion
As a data scientist,
I want to create a public python module to run inference using the models trained in this project,
So that these disk health prediction models are accessible and usable by a wider audience than just ceph users.
As a ceph product manager,
I want to use a separate, standalone python module for disk health prediction,
So that data preprocessing, feature selection, model selection, and other steps of the ML pipeline are decoupled from the ceph codebase. The ceph manager daemon should be able to just pass raw data to this module, and get health predictions in return.
Acceptance Criteria
As a data scientist,
I want to perform an initial EDA on the FAST dataset,
So that I understand the salient features of this dataset, and see how it compares to backblaze and ceph-telemetry datasets that I have worked with before. This should give me some idea for how to go about performing classification and regression tasks with this dataset.
This story is a part of epic #43
Acceptance Criteria
As a data scientist,
I want to run the previous forecasting experiments on the new FAST dataset,
So that I can
This story is a part of epic #43
Acceptance Criteria
Feedback no. 4
In the current forecasting notebook, we assumed that the maximum number of days of data that we are guaranteed to have at runtime is 6. However after talking to ceph subject matter experts, it seems that there might be some flexibility there.
On the one hand, having more amount of data available might improve model accuracy. But on the other hand, this would mean users have to store more health data locally. The main purpose of this issue is to figure out the “sweet spot” such that not a lot of data is stored and yet model performance is also improved.
As a data scientist, I want to explore how model performance changes with number of days of data available at runtime, to find a reasonable compromise between amount of data stored and model accuracy achieved.
Acceptance criteria:
Feedback no. 2
The current forecasting notebook has a univariate time series forecasting setup. That is, only one SMART metric forecasting at a time, independent of the other metrics. However, it is likely that there are interactions / dependencies across SMART metrics.
As a data scientist, I want to explore training multivariate time series forecasting models, which can capture interactions between various SMART metrics and possibly improve forecasting power.
Acceptance criteria
As a Ceph product manager,
I want a publicly accessible long term storage space,
So that I can regularly upload Ceph telemetry data there and external users can access it easily.
As per the discussion in operate-first/support#198, we requested an account on MOC. We now need to figure out how to set up an s3 bucket associated with this account. Since we have not worked with OpenStack before, we will need to open a ticket and start a thread with MOC folks on how to do this.
Acceptance Criteria
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.