sintel-dev / zephyr Goto Github PK

View Code? Open in Web Editor NEW

9.0 9.0 2.0 137 KB

https://dtail.gitbook.io/zephyr/

License: MIT License

Makefile 4.15% Python 53.53% Jupyter Notebook 42.32%

automl machine-learning wind-energy

zephyr's People

Contributors

Stargazers

Watchers

Forkers

sarapido boom90lb

zephyr's Issues

Add XGB Pipeline

add XGB primitive with predict_proba
add FindThreshold primitive
create XGB pipeline
create a Zephyr class to interact with pipelines.

Label search is dropping empty

I have been looking at the labelling tool, I believe we are losing many False classes as we are not setting 'drop_empty' to False in the label make search method. I think this would have a big impact is out monitoring records are abbreviated or pre-filtered (to some degree).

https://compose.alteryx.com/en/stable/generated/methods/composeml.LabelMaker.search.html#composeml.LabelMaker.search

drop_empty (bool) – Whether to drop empty slices. Default value is True

Make the labelling data slice creation clearer, controls over "overlap", "consecutive" and "spread out".

I have noted that the labelling data slices are not well controlled, we should have clear guidance and/or controls over how the slices are generated.

I think most labels should overlap in some way to present more training examples for positive classes.

https://compose.alteryx.com/en/stable/user_guide/data_slice_generator.html?highlight=overlap#Data-Slices

Design of Zephyr hyperparameters

Currently Supported Design

current approach to change hyperparameters is through a dictionary we pass to the zephyr api. This dictionary specifies the primitive and the hyperparameter of that primitive that the user wishes to modify.

An example code for an xgb pipeline

"hyperparameters = {
    "xgboost.XGBClassifier#1": {
        "n_estimators": 50
    }
}

zephyr = Zephyr('xgb', hyperparameters)

Ideal Design

the proposed design is that hyperparameters can be exposed at a pipeline level without the need of a dictionary.

An example code that should achieve the same changes in the previous example

zephyr = Zephyr('xgb', n_estimators=50)

under the hood, these hyperparameters should be mapped to the right primitive and altered.

Challenges

there are several cases that need to be resolved with this strategy:

if the hyperparameter name belongs to more than one primitive, which one is the user altering?
these hyperparameters are dynamic in a sense that they change from pipeline to pipeline, how do we support them? how would the user know what hyperparameters they can change?
can these hyperparameters be modified in only instantiation phase or should we allow other phases like fit? e.g. epochs.

Integrating `SigPro`

SigPro allows users to process time series signals and perform a wide-range of transformations and aggregations. We want to allow users to use SigPro through Zephyr.

Assume we have transformations and aggregations, we want to apply them to pidata or scada data.

Desired Behavior

Suppose we have the following pidata dataframe

_index	timestamp	COD_ELEMENT	val1	va2
0	2022-01-02 13:21:01	0	1002.0	-98.7
1	2022-03-08 13:21:01	0	56.8	1004.2

and we want to compute the mean of the amplitude using the SigPro primitive sigpro.aggregations.amplitude.statistical.mean for each month of readings for the column val, then we get the following processed dataframe

_index	time	mean
0	2022-01-31	1002.0
1	2022-02-28	null
2	2022-03-31	56.8

Proposed Function

def process_signals(es, signal_dataframe_name, signal_column, transformations, aggregations,
                    window_size, replace_dataframe=False, **kwargs):
    '''Process signals using SigPro.

    Apply SigPro transformations and aggregations on the specified entity from the
    given entityset. If ``replace_dataframe=True``, then the old entity will be updated.

    Args:
        es (featuretools.EntitySet):
            Entityset to extract signals from.
        signal_dataframe_name (str):
            Name of the dataframe in the entityset containing signal data to process.
        signal_column (str):
            Name of column or containing signal values to apply signal processing pipeline to.
        transformations (list[dict]):
            List of dictionaries containing the transformation primitives.
        aggregations (list[dict]):
            List of dictionaries containing the aggregation primitives.
        window_size (str):
            Size of the window to bin the signals over. e.g. ('1h).
        replace_dataframe (bool):
            If ``True``, will replace the entire signal dataframe in the EntitySet with the
            processed signals. Defaults to ``False``, creating a new child dataframe containing
            processed signals with the suffix ``_processed``.
    '''

Display summary of loaded entities

Currently the validation function in Zephyr shows an error when loading an entityset does not conform to the expected metadata. For example, if turbines entity is missing COD_ELEMENT the following error will be displayed

ValueError: Turbines index column "COD_ELEMENT" missing from data for turbines entity

Rather than erroring out, the validation function should display an entire summary for each entity:

whether it passed or not
if it did not pass, which columns are missing / have a problem and what is the message

This will look something like unittesting print

Name                                                                      Pass   Fail  Cover
---------------------------------------------------------------------------------------------
turbines                                                                   x      1     90%
work_orders                                                                x      1     95%
notifications                                                              x      1     92%
stoppages                                                                  ✓            100%
alarms                                                                     ✓            100%
pidata                                                                     ✓            100%
---------------------------------------------------------------------------------------------
TOTAL                                                                      6      3     96%

turbines
ValueError: Turbines index column "COD_ELEMENT" missing from data for turbines entity

work_orders
ValueError: Expected index column "COD_ORDER" of work_orders entity is not unique

notifications
ValueError: Missing time index column "DAT_POSTING" from notifications entity

To use Zephyr, the validation above must all pass with 100%.

Create vibrations entityset

In addition to create_entityset_scada and create_entityset_pidata, we want to add create_entityset_vibrations.

Entities to include:

Turbines
Alarms
Notifications
Stoppages
Work Orders
Pi Data
Vibrations

Necessary columns in vibrations

COD_ELEMENT
timestamp
signal_id
xvalues
yvalues

Metadata of vibrations

'vibrations': {
    'index': '_index',
    'make_index': True,
    'time_index': 'timestamp',
    'logical_types': {
        'COD_ELEMENT': 'categorical',
        'turbine_id': 'categorical',
        'signal_id': 'categorical',
        'timestamp': 'datetime',
        'sensorName': 'categorical',
        'sensorType': 'categorical',
        'sensorSerial': 'integer_nullable',
        'siteName': 'categorical',
        'turbineName': 'categorical',
        'turbineSerial': 'integer_nullable',
        'configurationName': 'natural_language',
        'softwareVersion': 'categorical',
        'rpm': 'double',
        'rpmStatus': 'natural_language',
        'duration': 'natural_language',
        'condition': 'categorical',
        'maskTime': 'datetime',
        'Mask Status': 'natural_language',
        'System Serial': 'categorical',
        'WPS-ActivePower-Average': 'double',
        'WPS-ActivePower-Minimum': 'double',
        'WPS-ActivePower-Maximum': 'double',
        'WPS-ActivePower-Deviation': 'double',
        'WPS-ActivePower-StartTime': 'datetime',
        'WPS-ActivePower-StopTime': 'datetime',
        'WPS-ActivePower-Counts': 'natural_language',
        'Measured RPM': 'double',
        'WPS-ActivePower': 'double',
        'WPS-Gearoiltemperature': 'double',
        'WPS-GeneratorRPM': 'double',
        'WPS-PitchReference': 'double',
        'WPS-RotorRPM': 'double',
        'WPS-Windspeed': 'double',
        'WPS-YawAngle': 'double',
        'overload warning': 'categorical',
        'bias warning': 'categorical',
        'bias voltage': 'double',
        'xValueOffset': 'double',
        'xValueDelta': 'double',
        'xValueUnit': 'categorical',
        'yValueUnit': 'categorical',
        'TotalCount-RPM0': 'double',
        'TotalCount-RPM1': 'double',
        'TotalCount-RPM2': 'double',
        'TotalCount-RPM3': 'double'
    }
}