derb12 / pybaselines Goto Github PK

A Python library of algorithms for the baseline correction of experimental data.

Home Page: https://pybaselines.readthedocs.io

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

python materials-characterization materials-science baseline-removal background-removal baseline-correction spectroscopy chemistry ftir raman

pybaselines's Introduction

pybaselines

pybaselines is a library of algorithms for the baseline correction of experimental data.

For Python 3.8+
Open Source: BSD 3-Clause License
Source Code: https://github.com/derb12/pybaselines
Documentation: https://pybaselines.readthedocs.io.

Contents

Introduction

pybaselines is a Python library that provides many different algorithms for performing baseline correction on data from experimental techniques such as Raman, FTIR, NMR, XRD, XRF, PIXE, etc. The aim of the project is to provide a semi-unified API to allow quickly testing and comparing multiple baseline correction algorithms to find the best one for a set of data.

pybaselines has 50+ baseline correction algorithms. These include popular algorithms, such as AsLS, airPLS, ModPoly, and SNIP, as well as many lesser known algorithms. Most algorithms are adapted directly from literature, although there are a few that are unique to pybaselines, such as penalized spline versions of Whittaker-smoothing-based algorithms. The full list of implemented algorithms can be found in the documentation.

Installation

Stable Release

pybaselines can be installed from pypi using pip, by running the following command in the terminal:

pip install pybaselines

pybaselines can alternatively be installed from the conda-forge channel using conda by running:

conda install -c conda-forge pybaselines

Development Version

The sources for pybaselines can be downloaded from the GitHub repo. To install the current version of pybaselines from GitHub, run:

pip install git+https://github.com/derb12/pybaselines.git#egg=pybaselines

Dependencies

pybaselines requires Python version 3.8 or later and the following libraries:

NumPy
SciPy

All of the required libraries should be automatically installed when installing pybaselines using any of the installation methods above.

The optional dependencies for pybaselines are listed in the documentation . To also install the optional dependencies when installing pybaselines with pip, run:

pip install pybaselines[full]

If installing with conda, the optional dependencies have to be specified manually.

Quick Start

To use the various functions in pybaselines, simply input the measured data and any required parameters. All baseline correction functions in pybaselines will output two items: a numpy array of the calculated baseline and a dictionary of potentially useful parameters. The main interface for all baseline correction algorithms in pybaselines is through the Baseline object for one dimensional data and Baseline2D for two dimensional data.

For more details on each baseline algorithm, refer to the algorithms section of pybaselines's documentation. For examples of their usage, refer to the examples section.

A simple example is shown below.

import matplotlib.pyplot as plt
import numpy as np
from pybaselines import Baseline, utils

x = np.linspace(1, 1000, 1000)
# a measured signal containing several Gaussian peaks
signal = (
    utils.gaussian(x, 4, 120, 5)
    + utils.gaussian(x, 5, 220, 12)
    + utils.gaussian(x, 5, 350, 10)
    + utils.gaussian(x, 7, 400, 8)
    + utils.gaussian(x, 4, 550, 6)
    + utils.gaussian(x, 5, 680, 14)
    + utils.gaussian(x, 4, 750, 12)
    + utils.gaussian(x, 5, 880, 8)
)
# exponentially decaying baseline
true_baseline = 2 + 10 * np.exp(-x / 400)
noise = np.random.default_rng(1).normal(0, 0.2, x.size)

y = signal + true_baseline + noise

baseline_fitter = Baseline(x_data=x)

bkg_1, params_1 = baseline_fitter.modpoly(y, poly_order=3)
bkg_2, params_2 = baseline_fitter.asls(y, lam=1e7, p=0.02)
bkg_3, params_3 = baseline_fitter.mor(y, half_window=30)
bkg_4, params_4 = baseline_fitter.snip(
    y, max_half_window=40, decreasing=True, smooth_half_window=3
)

plt.plot(x, y, label='raw data', lw=1.5)
plt.plot(x, true_baseline, lw=3, label='true baseline')
plt.plot(x, bkg_1, '--', label='modpoly')
plt.plot(x, bkg_2, '--', label='asls')
plt.plot(x, bkg_3, '--', label='mor')
plt.plot(x, bkg_4, '--', label='snip')

plt.legend()
plt.show()

The above code will produce the image shown below.

Contributing

Contributions are welcomed and greatly appreciated. For information on submitting bug reports, pull requests, or general feedback, please refer to the contributing guide.

Changelog

Refer to the changelog for information on pybaselines's changes.

License

pybaselines is open source and freely available under the BSD 3-clause license. For more information, refer to the license.

Citing

If you use pybaselines for published research, please consider citing by following the guidelines in the documentation.

Author

Donald Erb <[email protected]>

pybaselines's People

Contributors

Stargazers

Watchers

Forkers

abdelq weber-th exagendiagnostics joamatab yuxuanliao xuejingru psolsfer allthingsraman sychang-manjaro isaxs xsxkair hackayan

pybaselines's Issues

JOSS paper?

In your citing section of the docs you list a bibtex entry for the github repo along with a zenodo doi. In addition to this (superseding perhaps) I'd suggest writing a JOSS paper: https://joss.theoj.org/

The process is quite easy and I think the work in this package well exceeds the baseline requirements for publication there.

On a personal note I really appreciated the depth of explanation inthe algorithms section of the docs, and also the level of optimization (e.g. using pentapy)

Rubberband correction

Description of the problem/new feature

Like in https://stats.stackexchange.com/questions/240679, I can't find a specific paper of origin for the method, but it's not an uncommon one and should probably be included.

Description of a possible solution or alternative

Here is my version of the implementation:

import numpy as np
from scipy.spatial import ConvexHull

def rb_baseline(x, y):
    v = ConvexHull(np.column_stack((x, y))).vertices
    v = np.roll(v, -v.argmin())
    v = v[:v.argmax() + 1]
    return np.interp(x, x[v], y[v])

It is very similar to the one in https://dsp.stackexchange.com/questions/2725, but with some fixes.

`collab_pls` does not iteratively update the baseline

Description

collab_pls does not iteratively update the baseline. Also, the docstring for method_kwargs is from v0.7, right? The ** does not apply anymore since method_kwargs should be a dict.

Your Setup

3.8.10 (default, Nov 26 2021, 20:14:08)
[GCC 9.3.0]
0.8.0

Example Code

See my comments below. To my understanding, setting method_kwargs['tol'] = np.inf leads to Whitaker optimization terminating after one iteration, regardless of the value of the loss function because (for example in aspls line 730):

if calc_difference < tol:
            break

will always lead to early stopping because anything is < np.inf. It seems to me that removing the method_kwargs['tol'] = np.inf will lead to proper optimization for the component spectra used in the collab.

def collab_pls(data, average_dataset=True, method='asls', method_kwargs=None, **kwargs):
    """
    Collaborative Penalized Least Squares (collab-PLS).

    Averages the data or the fit weights for an entire dataset to get more
    optimal results. Uses any Whittaker-smoothing-based or weighted spline algorithm.

    Parameters
    ----------
    data : array-like, shape (M, N)
        An array with shape (M, N) where M is the number of entries in
        the dataset and N is the number of data points in each entry.
    average_dataset : bool, optional
        If True (default) will average the dataset before fitting to get the
        weighting. If False, will fit each individual entry in the dataset and
        then average the weights to get the weighting for the dataset.
    method : str, optional
        A string indicating the Whittaker-smoothing-based or weighted spline method to
        use for fitting the baseline. Default is 'asls'.
    **method_kwargs
        Keyword arguments to pass to the selected `method` function.

    Returns
    -------
    baselines : np.ndarray, shape (M, N)
        An array of all of the baselines.
    params : dict
        A dictionary with the following items:

        * 'average_weights': numpy.ndarray, shape (N,)
            The weight array used to fit all of the baselines.

        Additional items depend on the output of the selected method. Every
        other key will have a list of values, with each item corresponding to a
        fit.

    References
    ----------
    Chen, L., et al. Collaborative Penalized Least Squares for Background
    Correction of Multiple Raman Spectra. Journal of Analytical Methods
    in Chemistry, 2018, 2018.

    """
    dataset, fit_func, _, method_kws = _setup_optimizer(
        data, method, (whittaker, morphological, classification, spline), method_kwargs,
        True, **kwargs
    )
    if dataset.ndim < 2:
        raise ValueError((
            'the input data must have a shape of (number of measurements, number of points), '
            f'but instead has a shape of {dataset.shape}'
        ))
    if average_dataset:
        _, fit_params = fit_func(np.mean(dataset.T, 1), **method_kws)
        method_kws['weights'] = fit_params['weights']
    else:
        weights = np.empty_like(dataset)
        for i, entry in enumerate(dataset):
            _, fit_params = fit_func(entry, **method_kws)
            weights[i] = fit_params['weights']
        method_kws['weights'] = np.mean(weights.T, 1)
        
    # Why is `tol` being set to `np.inf`?
    # With `tol=np.inf`, the Whitaker algorithms only go though one optimization iteration
    # This can be verfied by inspecting the `params` output of this function, where it . . .
    # . . . can be seen that after fitting with `collab_pls` the `tol_history` is a `list` of `np.array`'s . . .
    # . . . where the list has length n-spectrum-used-to-colab and each array is n-optimization-iters long. 
    # When `colab_pls` is used each of the arrays in the list are 1 entry long.
    #  In contrast the `params` portion of the output of non-colab `aspls`, for example, will contain a `np.array` of decreasing loss . . .
    # . . . values. 
    method_kws['tol'] = np.inf
    baselines = np.empty(dataset.shape)
    params = {'average_weights': method_kws['weights']}
    method = method.lower()
    if method == 'fabc':
        # have to handle differently since weights for fabc is the mask for
        # classification rather than weights for fitting
        fit_func = _whittaker_smooth
        for key in list(method_kws.keys()):
            if key not in {'weights', 'lam', 'diff_order'}:
                method_kws.pop(key)

    for i, entry in enumerate(dataset):
        baselines[i], param = fit_func(entry, **method_kws)
        if method == 'fabc':
            param = {'weights': param}
        for key, value in param.items():
            if key in params:
                params[key].append(value)
            else:
                params[key] = [value]

    return baselines, params

Error Message

No errors

Switch to using classes

Description of the problem/new feature

Switch to using classes for all of the algorithms. The init could take the optional x-values and maybe some flags to skip sorting, skip checking if the input is finite, desired output dtype, etc. Then each fit takes the y-values and any parameters.

Benefits include:

Unifies the api. No longer need to remember which functions take both x and y or just y (or just x for misc.interp_pts...).
A lot of the setup could be faster; for example, the Vandermonde and spline basis only need to be made once as long as the polynomial/spline degree do not change.
Much easier to make universal changes, such as controlling the output baseline dtype, masking to allow for fixed weights, masking nan, etc.
The current functional api can be easily maintained for backward compatibility (see below).

Description of a possible solution or alternative

The syntax would look something like:

from pybaselines.polynomial import Polynomial
from pybaselines.whittaker import Whittaker
# or maybe: from pybaselines.api import Polynomial, Whittaker

# single fit; functionally equivalent to current implementation;
# x-values are not used by asls but can still be input if so desired
output = Whittaker(x).asls(y, lam=1e6)
output2 = Polynomial(x).modpoly(y, poly_order=2)

# multiple fits with different x-values or different sized 
# inputs; functionally equivalent to current implementation
poly_outputs = []
whittaker_outputs = []
for x_vals, y_vals in zip(x_values, y_values):
    poly_outputs.append(Polynomial(x_vals).modpoly(y_vals, poly_order=2))
    whittaker_outputs.append(Whittaker().asls(y_vals, lam=1e6))

# multiple fits with same x-values or data that is all
# the same size; potentially faster than current implementation
poly_fitter = Polynomial(x)
whittaker_fitter = Whittaker()
poly_outputs = []
whittaker_outputs = []
for y_vals in y_values:
    poly_outputs.append(poly_fitter.modpoly(y_vals, poly_order=2))
    whittaker_outputs.append(whittaker_fitter.asls(y_vals, lam=1e6))

The initial setup is done during the first method call, and any necessary changes are made on subsequent calls. If x-values are not given, then they are created during the first method call with the same size as y. If the object is called with y-values that have a different size from the initial call, it is probably easiest and more consistent to just raise an Exception rather than trying to allow changing the size only if no x-values were input.

The classes could all be accessible from one module, such as pybaselines.api or pybaselines.core, so that it's much easier to use versus having to import each from a separate module.

The functional api can be maintained for backwards compatibility by doing:

def asls(data, ...):
    return Whittaker().asls(data, ...)

def modpoly(data, x_data, ...):
    return Polynomial(x_data).modpoly(data, ...)

Could also add x_data keyword arguments to all functions that don't currently take them (plus data to misc.interp_pts) to unify the current functional api as well.

derb12 / pybaselines Goto Github PK

pybaselines's Introduction

pybaselines

Introduction

Installation

Stable Release

Development Version

Dependencies

Quick Start

Contributing

Changelog

License

Citing

Author

pybaselines's People

Contributors

Stargazers

Watchers

Forkers

pybaselines's Issues

Description of the problem/new feature

Description of a possible solution or alternative

Description

Your Setup

Example Code

Error Message

Description of the problem/new feature

Description of a possible solution or alternative

Recommend Projects

Recommend Topics

Recommend Org