Giter Site home page Giter Site logo

cotede's Introduction

CoTeDe

image

image

Documentation Status

image

image

image

image

CoTeDe is an Open Source Python package to quality control (QC) oceanographic data such as temperature and salinity. It was designed to attend individual scientists as well as real-time operations on large data centers. To achieve that, CoTeDe is highly customizable, giving the user full control to compose the desired set of tests including the specific parameters of each test, or choose from a list of preset QC procedures.

I believe that we can do better than we have been doing with more flexible classification techniques, which includes machine learning. My goal is to minimize the burden on manual expert QC improving the consistency, performance, and reliability of the QC procedure for oceanographic data, especially for real-time operations.

CoTeDe is the result from several generations of quality control systems that started in 2006 with real-time QC of TSGs and were later expanded for other platforms including CTDs, XBTs, gliders, and others.

Why CoTeDe

CoTeDe contains several QC procedures that can be easily combined in different ways:

  • Pre-set standard tests according to the recommendations by GTSPP, EGOOS, XBT, Argo or QARTOD;
  • Custom set of tests, including user defined thresholds;
  • Two different fuzzy logic approaches: as proposed by Timms et. al 2011 & Morello et. al. 2014, and using usual defuzification by the bisector;
  • A novel approach based on Anomaly Detection, described by Castelao 2021 (available since 2014 http://arxiv.org/abs/1503.02714).

Each measuring platform is a different realm with its own procedures, metadata, and meaningful visualization. So CoTeDe focuses on providing a robust framework with the procedures and lets each application, and the user, to decide how to drive the QC. For instance, the pySeabird package is another package that understands CTD and uses CoTeDe as a plugin to QC.

Documentation

A detailed documentation is available at http://cotede.readthedocs.org, while a collection of notebooks with examples is available at http://nbviewer.ipython.org/github/castelao/CoTeDe/tree/master/docs/notebooks/

Citation

If you use CoTeDe, or replicate part of it, in your work/package, please consider including the reference:

Castelão, G. P., (2020). A Framework to Quality Control Oceanographic Data. Journal of Open Source Software, 5(48), 2063, https://doi.org/10.21105/joss.02063

@article{Castelao2020,
  doi = {10.21105/joss.02063},
  url = {https://doi.org/10.21105/joss.02063},
  year = {2020},
  publisher = {The Open Journal},
  volume = {5},
  number = {48},
  pages = {2063},
  author = {Guilherme P. Castelao},
  title = {A Framework to Quality Control Oceanographic Data},
  journal = {Journal of Open Source Software}
}

For the Anomaly Detection techinique specifically, which was implemented in CoTeDe, please include the reference:

Castelão, G. P. (2021). A Machine Learning Approach to Quality Control Oceanographic Data. Computers & Geosciences, https://doi.org/10.1016/j.cageo.2021.104803

@article{Castelao2021,
  doi = {10.1016/j.cageo.2021.104803},
  url = {https://doi.org/10.1016/j.cageo.2021.104803},
  year = {2021},
  publisher = {Elsevier},
  author = {Guilherme P. Castelao},
  title = {A Machine Learning Approach to Quality Control Oceanographic Data},
  journal = {Computers and Geosciences}
}

If you are concerned about reproducibility, please include the DOI provided by Zenodo on the top of this page, which is associated with a specific release (version).

cotede's People

Contributors

bkatiemills avatar castelao avatar kthyng avatar s-good avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cotede's Issues

Error in fuzzylogic notebook

This is probably abandoned, but there is a mistake in the fuzzy_logic.ipnb where the fuzzy functions are generated.

for each line, where there is something like:
data['spike_lo'] = fuzz.zmf(data['x_spike'], cfg['fuzzylogic']['features']['spike']['low'])

it should be:
data['spike_lo'] = fuzz.zmf(data['x_spike'], cfg['fuzzylogic']['features']['spike']['low']['params'])

So the parameters of the z-membership function are loaded properly. The author probably modified load_cfg and forgot to update the notebook.

Flag 9 is not being set

Flag 9 is used for non available or NaN.

It is not being set due a bad use of ~np.isfinite() in a MaskedArray.

Double weight on first feature of Anomaly Detection

The first feature evaluated in Anomaly Detection was being considered twice.

Although it's a conceptual error since I'm not using weights at this point, it shouldn't compromise the classification results.

Problems following basic usage - logger issue

Wanted to try out your algorithms and see if they can be applied to our data... installed the package per instructions and got the following error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-43-02ba7dda9ecb> in <module>
----> 1 pqc = fProfileQC('dPIRX003.cnv')

~/anaconda2/envs/cotede/lib/python3.7/site-packages/cotede/qc.py in __init__(self, inputfile, cfg, saveauxiliary, verbose, logger)
    467             # Not the best way, but will work for now. I should pass
    468             #   the reference for the logger being used.
--> 469             input = cnv.fCNV(inputfile, logger=None)
    470         except CNVError as e:
    471             #self.attributes['filename'] = basename(inputfile)

TypeError: __init__() got an unexpected keyword argument 'logger'

Any guidance? I've tried it on python 3.7, and 2.7 with the same error.
Thanks

valid_date instead of valid_datetime

All preset QC configuration files were using 'valid_date' key while the QC engine was searching for 'valid_datetime'.

I'll use datetime instead of date to make clear that time will be together, when available.

woa_comparison as float64

ProfileQCCollection.flags is returning woa_comparison as a float64. It was supposed to be an integer

Incorrect sub-samplig by split_data_groups()

Unavailable data, flagged 9 or masked, was being considered as bad data by split_data_groups(), while it should be ignored by the adjusting procedure to define the parameters for the Anomaly Detection.

split_data_groups() randomly sub-sample the data into fit, test and err groups for the anomaly detection procedure.

Fuzzylogic and Morello2014 tests

Hi @castelao, I noticed you have two config files for fuzzy logic tests - fuzzylogic.json and morello2014.json. The settings look very similar in both. For AutoQC should I consider these as separate tests or would you say that they are too similar to bother?

nan in gsw output

density_invertion test uses gsw, which return nan as result of invalid inputs. When comparing with the threshold, it returns an annoying message:
RuntimeWarning: invalid value encountered in greater_equal

Remove DAP dependency

Would be nice to do not depend on the PyDAP by default, but let the user know the user know what can't be done without it.

Create a plan to extend tests for Chlorophyll

CoTeDe is currently operates with temperature and salinity. What is needed to start to evaluate chlorophyll fluorescence (fchl)?

I know of two good references for QC procedures. Do you know more?

Some tests like valid location already exist in CoTeDe.

  • Create a config file (QC descriptor) for the tests already available;
  • List the desired tests to include;

Avoid dependency on Scipy

It's only used for some interpolation. It's easy to avoid it. Although Scipy is a quite nice package, there is no sense on require it just for 2D linear interpolation.

Resolve flake8 failures and/or add flake8-ignore

The CONTRIBUTING docs recommend running flake8. However when I tried this, there were a lot of flake8 failures. I recommend resolving those and/or adding a list of ignored tests to flake8-ignore. Otherwise, as a contributor I'd just ignore the output of these tests.

i2b_flags fails with pandas.Series()

i2b_flags() would fail if loaded with a pandas.Series().

The solution should be generic and avoid the requirement in the pandas package, i.e. be able to handle a pd.Series() without explicitly checking if it is a pd.Series().

Incorrect top params fit

It's a minor bug. When defining the top n% of the samples to fit the exponweib.pdf(), it was considering the total N that included non valid data.

A sample can be valid, but not possible to evaluate in a specific test, like a place with a climatology build with only one historical sample. In that case, I do not consider the climatology result, so there is a sample that could be flagged valid by other tests, but do not have a climatology test result.

GTSPP config for WOA test

Hi @castelao. Thanks for making this software available! I've been working through all the tests you have implemented and I noticed that the WOA_normbias config in gtspp.json includes the variable 't_an', which I think should be 't_mn'. Also, I'm not sure about this, but I think that the 'at_sea' test in the same file (and some of the other config files) may not work unless it is called 'location_at_sea'?

location_at_sea: one vs multiple positions

For TSG each mesurement has its positions, requiring multiple evaluations of location_at_sea test.

Rethink the function itself, as well as the call on: common and evaluate.

Gradient test, flag 9 on first element

The gradient test frequently returns the first element flag as 9.

If there is none masked element on the profile, the .mask returns one boolean False. The proper way to do this is using getmaskarray().

flag setting in location_at_sea.py

Hi @castelao, sorry to raise another issue but I found that my code was failing if a profile location was on land. I think it might be because 'flag' is not set before it is used in location_at_sea.py. If I set flag = 3 it runs through as expected.

No flag if lat or lon are not available

If latitude or longitude are not available at .attributes, it fail safe in a except but do not define any flag for that. It's probably better to return a failing flag.

add alternative methods for calculating salinity

My packages (eg. https://github.com/evanleeturner/sonde3) use the PyPi seawater package to calculate salinity: https://pypi.org/project/seawater/

Since calculating salinity may be different among groups and also affects your written QAPP/QAQC having the ability to apply alternative methods for seawater conversion instead of the GSW package using the Thermodynamic Equation of Seawater 2010 (TEOS-10) method or creating specific forks to the CoTeDe package would be an excellent feature of this codebase!

Threshold for tukey53H_norm in cotede configuration

Hi again @castelao, I just wanted to check what threshold should be for the tukey53H_norm test in the cotede.json configuration? At the moment it looks like it is picking up the value of the threshold from the last test run (2). Is this the correct value to use?

Failing on descentPrate()

descentPrate() always fail.

It was partially updated for the new pattern to receive the whole data object instead of the old pattern that would receive directly and only the required variables.

Remove dependency on Pandas

It would still require Pandas for some functionalities, but it must be possible to install and use the core applications without Pandas.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.