pvlib / pvanalytics Goto Github PK

View Code? Open in Web Editor NEW

89.0 11.0 28.0 4.23 MB

Quality control, filtering, feature labeling, and other tools for working with data from photovoltaic energy systems.

Home Page: https://pvanalytics.readthedocs.io

License: MIT License

Python 100.00%

photovoltaic python renewable-energy renewables solar-energy

pvanalytics's Introduction

PVAnalytics

PVAnalytics is a python library that supports analytics for PV systems. It provides functions for quality control, filtering, and feature labeling and other tools supporting the analysis of PV system-level data.

PVAnalytics is available at PyPI and can be installed using pip:

pip install pvanalytics

Documentation and example usage is available at pvanalytics.readthedocs.io.

Library Overview

The functions provided by PVAnalytics are organized in modules based on their anticipated use. The structure/organization below is likely to change as use cases are identified and refined and as package content evolves. The functions in quality and features take a series of data and return a series of booleans. For more detailed descriptions, see our API Reference.

quality contains submodules for different kinds of data quality checks.
- data_shifts contains quality checks for detecting and isolating data shifts in PV time series data.
- irradiance provides quality checks for irradiance measurements.
- weather has quality checks for weather data (for example tests for physically plausible values of temperature, wind speed, humidity, etc.)
- outliers contains different functions for identifying outliers in the data.
- gaps contains functions for identifying gaps in the data (i.e. missing values, stuck values, and interpolation).
- time quality checks related to time (e.g. timestamp spacing)
- util general purpose quality functions.
features contains submodules with different methods for identifying and labeling salient features.
- clipping functions for labeling inverter clipping.
- clearsky functions for identifying periods of clear sky conditions.
- daytime functions for for identifying periods of day and night.
- orientation functions for labeling data as corresponding to a rotating solar tracker or a fixed tilt structure.
- shading functions for identifying shadows.
system identification of PV system characteristics from data (e.g. nameplate power, orientation, azimuth)
metrics contains functions for computing PV system-level metrics

pvanalytics's People

Contributors

Stargazers

Watchers

pvanalytics's Issues

Consider archiving pvanalytics releases on zenodo

Zenodo is a free archival service that many OSS projects use to archive copies of the repository when a new release is cut. See for example the latest releases for pvlib and rdtools. One advantage Zenodo has over the github releases page is that a DOI is automatically minted for each archive, making it somewhat easier for people to cite specific versions of the package.

There is a way to configure zenodo to automatically create a new archive whenever a new release is created on github. I've not done it myself so I'm not sure about the details, but I think it's via a webhook in the repo settings. Anyway maybe worth setting up that integration for pvanalytics.

Edit: https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content

Clipping levels function

I was trying the clipping detection functions, and clipping.levels shows a behavior that I was not expecting.
The default number of levels is 2, and in my case (see figure below, a small part of the used data for a full year) it identifies night periods as clipping. If I force levels=1, clipping is being identified only during the night because it's the interval with the most data points (?).

Am I using the function wrong? Or should there be, for example, a previous step where daytime is identified?

Detect sunny days

The PVFleets QA project does this by fitting curves (quadratic for fixed systems, quartic for tracking systems) on each day.

Tests for clearsky.reno fail with pvlib 0.8.1

PVLib 0.8.1 changed the labeling of clear-sky periods (pvlib/pvlib-python#1074) which is resulting in a test failures in test_clearsky.py.

The change in pvlib labels windows on the center by default, while previous behavior labeled on the left. We probably want to accept the new default behavior from PVLib (updating the tests to match) and expose the new align parameter in pvanalytics.clearsky.reno.

Add function to detect and label F vs. C temperature

Idea proposed by Todd Karin

Automatically detect "filler" values

Many data sets have filler values in them representing either times when the sensor or data logger was not working (i.e. 999, -999). When we know what the filler values are it is fairly straightforward to detect/remove them; however, we may not always know what values are used as "filler." We should add some methods for automatically identifying filler-values in quality.gaps.

One method for automatically identifying filler-values could be to look for identical outliers that occur repeatedly throughout the data. There are almost certainly other ways to do this. This issue can serve as a solicitation for these methods if anyone has any suggestions.

clearsky algorithm with POA data?

What is the expected behavior of the clear sky algorithm applied to plane-of-array data?

Adopt dependency management policy

NEP-29 is probably the one to use (see #81). We will support minor python versions going back 42 months (currently python >= 3.6) and numpy/scipy released in the previous 24 months (currently scipy >= 1.2 and numpy >= 1.15).

It might also a good idea to apply this to the pandas version as well, meaning we would support pandas >= 0.23 (and < 1.1 still, see #82).

Sub-packages not included by setup.py

Need to add list of all sub packages in call to setuptools.setup().

Determine azimuth and tilt for fixed PV systems

Given a series of power or irradiance measurements determine the azimuth and tilt of the sensor/system.

The PVFleets QA code does this by comparing the solar azimuth at the maximum of parabolas fit to the data for each day and the solar azimuth for the maximum POA irradiance of each day with different tilt and system azimuths. The tilt-azimuth pair that gives the smallest difference is returned as the tilt and azimuth of the system.

Set up packaging

We need an installation guide in #2. Before we do that we should set up packaging for the library with setuptools.

Unused parameter in qcrad irradiance consistency check

The dni_extra parameter to quality.irrandiance.check_irradiance_consistency_qcrad() is unused and should be removed.

Set up continuous integration/delivery

build and test with ~~Travis or Azure Pipelines or~~ GithubActions
code coverage with coveralls and/or codecov
set up lgtm or other linter
package and deploy to pypi

Add shadow detection

Should go in the features module, I think.

Add irradiance QC algorithm from Forstinger et al.

Conference paper here. Looks to be an improvement over Long and Shi.

Set up readthedocs

We already have an empty skeleton for a sphinx documentation tree, need to set up a readthedocs page/build to host the documentation.

Handling missing data in outliers

Hello, first of all very interesting project!

I was trying out the outlier funcions. The outliers.zscore function (which depends on scipy.stats.zscore) returns a series full of NaNs when at least one NaN is present in data.

Would it be helpful to use the option of ignoring such data? Like stats.zscore(data, nan_policy='omit') or dropping NaN values?
Or handling this type of data is not to be contemplated?

Rename hypothetical 'dataclasses' module in readme/docs

We should choose a name that does not shadow python's dataclasses module.

Need to change the last line of the README.md and the reference in docs/index.rst.

Arose in #111.

Support pandas >= 1.1.0

Pandas seems to have ironed out the .groupby().rolling() bugs that were introduced in version 1.1.0. Unfortunately, while the two obvious bugs that were initially holding up this change have been fixed, the behavior seems to have changed. In particular these two tests fail at pandas 1.1.3:

tests/features/test_daytime.py::test_daytime_split_day
tests/features/test_daytime.py::test_daytime_daylight_savings

The output for both classifies some data that is obviously daytime as night. Might still be a pandas bug, but I'm not sure at this point.

(single-axis) trackers

Three things we want PVAnalytics to tell us about PV systems with single-axis trackers (to start).

does the system have a tracker (based on power or POA irradiance)
does it have a fixed orientation
If it does have a tracker, which days is the tracker functioning correctly (and which days is it stuck)

The first two can be addressed by #49 which uses an envelope fitting strategy. The third point can be addressed by #52. Other techniques (e.g. https://doi.org/10.1002/pip.3002) can also be applied to the third point given more information about the system.

I'm opening this to try to keep track of the bigger picture. Add any additional desiderata or methodology for the above items here.

Function to calculate the Variability Index

I have a function somewhere that calculates the variability index (VI). I think this package would be a good home for it and I'd be happy to contribute it if there is a place for it.

The features module might be an appropriate place, but a daily float value for VI is returned rather than a boolean.

Increase default window in stale_values_diff and interpolation_diff

Consider setting window=6 as default. See discussion

orientation._freqstr_to_hours returns minutes not hours

Should this line be ' / 3600'?

Publish to pypi when release is tagged

Configure github action to build and publish packages on pypi when a release is tagged and all tests have passed.

This can be added as a new job to the "lint and test" workflow using the needs key

jobs:

  # [...]

  publish: 
    runs-on: ubuntu-latest
    needs: [lint, test]
    if: github.event_name == "release" # ???

    steps:
      # checkout, setup-python, install setuptools and twine, build package, upload package

Invalid argument to albuquerque test fixture

albuquerque fixture passes tx keyword parameter to pvlib.location.Location instead tz.

Flesh out README

Need to add a more detailed overview of the library to the README.

Should include at least some documentation of the dependencies and instructions for installing/building the library.

Sphinx docs don't show example usage

Currently the sphinx docs are focused primarily on API documentation. Of course that's a necessary component for good docs, but example usage would be a good addition. #124 adds one example file and I think the plan is for future PRs to do the same. I'm aware of a few common ways of showing example usage in sphinx docs (probably there are more I'm not thinking of):

Docstring examples: Numpydoc style allows an Examples section with short code input and output snippets (ref). For functions with simple inputs and outputs this can be a good option, but I'm not sure it's suitable for pvanalytics functions, which typically have big timeseries inputs and outputs. Here's an example of what it looks like from numpy: https://numpy.org/doc/stable/reference/generated/numpy.sin.html
Example scripts using sphinx-gallery: sphinx-gallery is a popular sphinx extension for building galleries of example scripts like pvlib's example gallery. It's better suited for examples longer than just one line. Example code is stored in .py files which get executed during the docs build.
Example scripts using nbsphinx: nbsphinx can do galleries similar to those of sphinx-gallery, but using jupyter notebooks instead of .py files for the source. The galleries are not quite as nice as those of sphinx-gallery because intersphinx linking doesn't work in nbsphinx galleries. The benefit here is that it uses the output stored in the notebook file, meaning the examples do not have to be executed during the docs build. We use it for the RdTools gallery because the examples take a long time to run.

Of these three I think option 2 makes the most sense for pvanalytics, at least for now: single-line examples seems too limiting, and notebooks are a pain to manage in git repos. If we eventually find ourselves unhappy about docs builds taking too long, or maybe some example can't be run on the CI for some reason, we can always switch over to the notebook-based galleries.

Any thoughts? I'll volunteer to set up sphinx-gallery if others agree it's the way to go.

Add installation instructions to README

Now that setuptools is configured we should add a section to the readme with installation instructions.

Migrate tests to pvlib-python v0.9.0

Remove the use of SingleAxisTracker (deprecated in pvlib-python v0.9.0).

Include some test that uses the new pvlib-python capability of a PVSystem having multiple Arrays.

port capabilities from pecos

Some pecos functions to consider porting:

Add a pull request template

Time for a PR template. This will help the community contribute and help me remember all the items on my mental checklist before requesting review or merging.

Consider enforcing PEP 257

Should we use flake8-docstrings to enforce PEP 257 as part of the CI process?

Add package overview to docs

Add an overview of the module organization to the documentation.

Handle missing data gracefully in clipping.levels()

features.clipping.levels() raises some very difficult to decipher value errors form numpy when there are NAs in the data. We should either raise a more meaningful and useful exception or simply deal with NAs by dropping them, applying the clipping filter, reindexing, and filling missing values with False.

GithubActions workflow should run on pull request

The lint-and-test workflow should be re-configured to run for pull_request events as well as push

Column Names and Translation Dictionaries

@cwhanse and @wfvining, I'm considering if I should pull some of the functionality of pvcaptest out and put it into a separate package and I'd like to get your feedback on if pvanalytics might be a good place or if I should create a new package.

There are two closely related features that I'm considering pulling out:

a class and set of functions for cleaning up column names, see the column_renaming branch of my fork of pvcaptest, and
functions that parse column names to categorize each measurement by equipment type and/or physical value measured creating a 'translation dictionary'

A substantial amount of pvcaptest functionality depends on having a translation dictionary (CapData.column_groups). This approach was originally inspired by the Pecos package. Pecos enables using the translation dictionary concept, but doesn't generate them.

The more performance engineering work I do, especially on tests with longer time frames, the more I think it would be valuable to use Pecos. To facilitate this, I think it would make sense to move automatic translation dictionary generation out of pvcaptest into a more general purpose package (pvanalytics?) that can output a translation dictionary that can be used in both pvcaptest and pecos.

The pvcaptest code that generates translation dictionaries is contained in the translation dictionaries, group_columns function, and the __series_type function. This algorithm works surprisingly well given how rudimentary it is, but it could definitely be greatly improved.

I started the tools to rename columns based on how much variety there is in column names coming from a wide range of DAS/SCADA vendors and projects. I think this has to be a first step to get any type of reliable results from the algorithm to automatically generate the translation dictionary.

Look forward to hearing your thoughts!

Linear Algebra Error thrown in earlier numpy versions when running orientation fixed_nrel() function.

Following error is thrown when running fixed_nrel() function on a azimuth/tilt estimation pipeline:

raise LinAlgError("SVD did not converge in Linear Least Squares")

LinAlgError: SVD did not converge in Linear Least Squares

This occurred when running numpy=1.19.1. Updated to numpy 1.19.5 and issue was resolved. Need to update minimum numpy installation.

Drop support for pandas < 1.2

See discussion for #118

scipy.stats.median_absolute_deviation is deprecated

Getting a warning from the tests for pvanalytics.quality.outliers

From the deprecation warning:

use median_abs_deviation instead!

To preserve the existing default behavior, use scipy.stats.median_abs_deviation(..., scale=1/1.4826). The value 1.4826 is not numerically precise for scaling with a normal distribution. For a numerically precise value, use scipy.stats.median_abs_deviation(..., scale='normal').

Enable ReadTheDocs PR builds

pvlib-python and rdtools have configured RTD to automatically build docs as one of the checks on a PR. I think it's a great feature and use it all the time -- I usually build docs locally for my own PRs, but when reviewing other people's it's very handy to not have to clone the branch and build locally to see how the docs turned out. Also useful for people who don't necessarily want to learn how to build docs locally. Any reason not to do the same for pvanalytics?

Add module-temperature/irradiance correlation check

PVFleets QA checks that the module temperature is correlated with the irradiance. This is a pass/fail quality check: if the temperature and irradiance have sufficiently high correlation over the entire series then the temperature sensor passes. It might be useful to apply the correlation check on rolling windows as well and return a boolean series rather than just a pass/fail.

I'm not sure where this should go in the quality module.

Add outlier detection for highly skewed data

@wfvining Thanks, these outlier filters are a good start and perform reasonably well for roughly symmetrically distributed data but are limited for highly skewed data. Unfortunately, outdoor PV data are in that category of highly skewed distributions. I would suggest adding some more robust outlier detectors, like the Hampel filter. It is based on the median and the median absolute deviation. You can also improve it by not using the medcouple instead of the median.

Originally posted by @dirkjordan in #21 (comment)

stale_values_diff and interpolation_diff : option to label all values

Consider adding an option to label all values in a stale / interpolation window as stale, rather than only labelling the end value in a window.

handle faulty DST logic with repeated indicies

I have a file in which the data appears to not use DST, but there's a complication. The transition to DST does occur at 2 am. Then an hour later the data falls back to 3 am. So the effect is that the 3 am hour is repeated in the data.

I speculate that this happens with a non-trivial number of data loggers, so it would be nice if pvanalytics could flag it and fix it. index.is_monotonic and index.duplicated() seem like good places to start, but I don't have a complete suggestion for how to implement.

See example in the details tag below, and find the file in the zip archive below that.

df.index[96598:96598+60*1+4]

DatetimeIndex(['2020-03-08 01:59:00', '2020-03-08 03:00:00',
               '2020-03-08 03:01:00', '2020-03-08 03:02:00',
               '2020-03-08 03:03:00', '2020-03-08 03:04:00',
               '2020-03-08 03:05:00', '2020-03-08 03:06:00',
               '2020-03-08 03:07:00', '2020-03-08 03:08:00',
               '2020-03-08 03:09:00', '2020-03-08 03:10:00',
               '2020-03-08 03:11:00', '2020-03-08 03:12:00',
               '2020-03-08 03:13:00', '2020-03-08 03:14:00',
               '2020-03-08 03:15:00', '2020-03-08 03:16:00',
               '2020-03-08 03:17:00', '2020-03-08 03:18:00',
               '2020-03-08 03:19:00', '2020-03-08 03:20:00',
               '2020-03-08 03:21:00', '2020-03-08 03:22:00',
               '2020-03-08 03:23:00', '2020-03-08 03:24:00',
               '2020-03-08 03:25:00', '2020-03-08 03:26:00',
               '2020-03-08 03:27:00', '2020-03-08 03:28:00',
               '2020-03-08 03:29:00', '2020-03-08 03:30:00',
               '2020-03-08 03:31:00', '2020-03-08 03:32:00',
               '2020-03-08 03:33:00', '2020-03-08 03:34:00',
               '2020-03-08 03:35:00', '2020-03-08 03:36:00',
               '2020-03-08 03:37:00', '2020-03-08 03:38:00',
               '2020-03-08 03:39:00', '2020-03-08 03:40:00',
               '2020-03-08 03:41:00', '2020-03-08 03:42:00',
               '2020-03-08 03:43:00', '2020-03-08 03:44:00',
               '2020-03-08 03:45:00', '2020-03-08 03:46:00',
               '2020-03-08 03:47:00', '2020-03-08 03:48:00',
               '2020-03-08 03:49:00', '2020-03-08 03:50:00',
               '2020-03-08 03:51:00', '2020-03-08 03:52:00',
               '2020-03-08 03:53:00', '2020-03-08 03:54:00',
               '2020-03-08 03:55:00', '2020-03-08 03:56:00',
               '2020-03-08 03:57:00', '2020-03-08 03:58:00',
               '2020-03-08 03:59:00', '2020-03-08 03:00:00',
               '2020-03-08 03:01:00', '2020-03-08 03:02:00'],
              dtype='datetime64[ns]', name='Time', freq=None)

FSEC_RTC_Weather_2020.csv.zip

QA methods for plane-of-array irradiance

This paper describes a workflow to establish quality of measured plane-of-array irradiance. The components and the workflow would be of significant value in this library.

determine if a time series is DST aware or not

I'd like a function that tells me if a time series uses a fixed offset timezone like 'Etc/GMT+7' or a DST aware timezone like 'America/Denver'.

In the PV world, we often don't have data that would occur right at the DST transition (2 am local). So my suggestion is to find the sunrise, sunset, and/or solar noon in the ~7 days before and after a DST transition. Then determine if there's a 1 hour shift in that data.

Add theoretical energy production and performance ratio

Is there an intention to add the option for creating a theoretical energy production based on the irradiance and possibly temperature? After that performance ratio could also be calculated.

Manage version numbers automatically

Versioneer seems like a very useful tool for managing version numbers.

Requirements in setup.py are redundant, and partially inconsistent, with requirements.txt

index.rst advertises the contents of requirements.txt as the package's dependencies:

pvanalytics/requirements.txt

Lines 1 to 6 in a3d3860

    
           numpy>=1.15.0 
        
           pandas>=0.24.0,<1.1.0 
        
           pvlib>=0.8.0 
        
           scipy>=1.2.0 
        
           statsmodels>=0.9.0 
        
           scikit-image>=0.16.0

But the pandas version range specified in setup.py is in fact slightly different (ref):

pvanalytics/setup.py

Lines 33 to 40 in a3d3860

    
           INSTALL_REQUIRES = [ 
        
               'numpy >= 1.15.0', 
        
               'pandas >= 0.24.0', 
        
               'pvlib >= 0.8.0', 
        
               'scipy >= 1.2.0', 
        
               'statsmodels >= 0.9.0', 
        
               'scikit-image >= 0.16.0' 
        
           ]

Worth considering having a "single source of truth" for the requirement ranges so that there's no way for redundant specs to get out of sync. I think some packages solve this by including a line in setup.py like this: INSTALL_REQUIRES = open('requirements.txt').read().split('\n'). Maybe eliminating requirements.txt and keeping version ranges in setup.py would make more sense in this case.

Same goes for DOCS_REQUIRE and docs/requirements.txt.

Ruptures-related test not skipped when missing ruptures

tet_shifts_ruptures_tz_localized() should be decorated with @requires_ruptures and skipped if ruptures is not installed.

	numpy>=1.15.0
	pandas>=0.24.0,<1.1.0
	pvlib>=0.8.0
	scipy>=1.2.0
	statsmodels>=0.9.0
	scikit-image>=0.16.0

	INSTALL_REQUIRES = [
	'numpy >= 1.15.0',
	'pandas >= 0.24.0',
	'pvlib >= 0.8.0',
	'scipy >= 1.2.0',
	'statsmodels >= 0.9.0',
	'scikit-image >= 0.16.0'
	]