Giter Site home page Giter Site logo

earthobservations / wetterdienst Goto Github PK

View Code? Open in Web Editor NEW
326.0 11.0 51.0 22.78 MB

Open weather data for humans.

Home Page: https://wetterdienst.readthedocs.io/

License: MIT License

Python 100.00%
deutscher-wetterdienst germany open-source open-data historical-data time-series dwd radar weather weatherservice

wetterdienst's Introduction

Wetterdienst - Open weather data for humans

warming stripes of Hohenpeissenberg/Germany

CI

CI: Overall outcome

Documentation status

CI: Code coverage

Meta

PyPI version

Conda version

Project license

Project status (alpha, beta, stable)

Python version compatibility

Downloads

PyPI downloads

Conda downloads

Citation

Citation reference

Introduction

Overview

Welcome to Wetterdienst, your friendly weather service library for Python.

We are a group of like-minded people trying to make access to weather data in Python feel like a warm summer breeze, similar to other projects like rdwd for the R language, which originally drew our interest in this project. Our long-term goal is to provide access to multiple weather services as well as other related agencies such as river measurements. With wetterdienst we try to use modern Python technologies all over the place. The library is based on polars (we <3 pandas, it is still part of some IO processes) across the board, uses Poetry for package administration and GitHub Actions for all things CI. Our users are an important part of the development as we are not currently using the data we are providing and only implement what we think would be the best. Therefore contributions and feedback whether it be data related or library related are very welcome! Just hand in a PR or Issue if you think we should include a new feature or data source.

Data

For an overview of the data we have currently made available and under which license it is published take a look at the data section. Detailed information on datasets and parameters is given at the coverage subsection. Licenses and usage requirements may differ for each provider so check this out before including the data in your project to be sure that you fulfill copyright requirements!

Features

  • APIs for stations and values
  • Get stations nearby a selected location
  • Define your request by arguments such as parameter, period, resolution, start date, end date
  • Define general settings in Settings context
  • Command line interface
  • Web-API via FastAPI
  • Rich UI features like wetterdienst explorer
  • Run SQL queries on the results
  • Export results to databases and other data sinks
  • Public Docker image
  • Interpolation and Summary of station values

Setup

Native

Via PyPi (standard):

pip install wetterdienst

Via Github (most recent):

pip install git+https://github.com/earthobservations/wetterdienst

There are some extras available for wetterdienst. Use them like:

pip install wetterdienst[sql]
  • docs: Install the Sphinx documentation generator.
  • ipython: Install iPython stack.
  • export: Install openpyxl for Excel export and pyarrow for writing files in Feather- and Parquet-format.
  • sql: Install DuckDB for querying data using SQL.
  • duckdb: Install support for DuckDB.
  • influxdb: Install support for InfluxDB.
  • cratedb: Install support for CrateDB.
  • mysql: Install support for MySQL.
  • postgresql: Install support for PostgreSQL.
  • interpolation: Install support for station interpolation.

In order to check the installation, invoke:

wetterdienst --help

Docker

Docker images for each stable release will get pushed to GitHub Container Registry.

wetterdienst serves a full environment, including all of the optional dependencies of Wetterdienst.

Pull the Docker image:

docker pull ghcr.io/earthobservations/wetterdienst

Library

Use the latest stable version of wetterdienst:

$ docker run -ti ghcr.io/earthobservations/wetterdienst
Python 3.8.5 (default, Sep 10 2020, 16:58:22)
[GCC 8.3.0] on linux
import wetterdienst
wetterdienst.__version__

Command line script

The wetterdienst command is also available:

# Make an alias to use it conveniently from your shell.
alias wetterdienst='docker run -ti ghcr.io/earthobservations/wetterdienst wetterdienst'

wetterdienst --help
wetterdienst --version
wetterdienst info

Raspberry Pi / LINUX ARM

Running wetterdienst on Raspberry Pi, you need to install numpy and lxml prior to installing wetterdienst by running the following lines:

# not all installations may be required to get lxml running
sudo apt-get install gfortran
sudo apt-get install libopenblas-base
sudo apt-get install libopenblas-dev
sudo apt-get install libatlas-base-dev
sudo apt-get install python3-lxml

Additionally expanding the Swap to 2048 mb may be required and can be done via swap-file:

sudo nano /etc/dphys-swapfile

Thanks chr-sto for reporting back to us!

Example

Task: Get historical climate summary for two German stations between 1990 and 2020

Library

>>> import polars as pl
>>> _ = pl.Config.set_tbl_hide_dataframe_shape(True)
>>> from wetterdienst import Settings
>>> from wetterdienst.provider.dwd.observation import DwdObservationRequest
>>> settings = Settings( # default
...     ts_shape="long",  # tidy data
...     ts_humanize=True,  # humanized parameters
...     ts_si_units=True  # convert values to SI units
... )
>>> request = DwdObservationRequest(
...    parameter="climate_summary",
...    resolution="daily",
...    start_date="1990-01-01",  # if not given timezone defaulted to UTC
...    end_date="2020-01-01",  # if not given timezone defaulted to UTC
...    settings=settings
... ).filter_by_station_id(station_id=(1048, 4411))
>>> stations = request.df
>>> stations.head()
┌────────────┬──────────────┬──────────────┬──────────┬───────────┬────────┬─────────────┬─────────┐
│ station_idstart_dateend_datelatitudelongitudeheightnamestate   │
│ ------------------------     │
│ strdatetime[μs, ┆ datetime[μs, ┆ f64f64f64strstr     │
│            ┆ UTC]         ┆ UTC]         ┆          ┆           ┆        ┆             ┆         │
╞════════════╪══════════════╪══════════════╪══════════╪═══════════╪════════╪═════════════╪═════════╡
│ 010481934-01-01   ┆ ...          ┆ 51.127813.7543228.0Dresden-KloSachsen │
│            ┆ 00:00:00 UTC00:00:00 UTC ┆          ┆           ┆        ┆ tzsche      ┆         │
│ 044111979-12-01   ┆ ...          ┆ 49.91958.9672155.0Schaafheim-Hessen  │
│            ┆ 00:00:00 UTC00:00:00 UTC ┆          ┆           ┆        ┆ Schlierbach ┆         │
└────────────┴──────────────┴──────────────┴──────────┴───────────┴────────┴─────────────┴─────────┘
>>> values = request.values.all().df
>>> values.head()
┌────────────┬─────────────────┬───────────────────┬─────────────────────────┬───────┬─────────┐
│ station_iddatasetparameterdatevaluequality │
│ ------------------     │
│ strstrstrdatetime[μs, UTC]       ┆ f64f64     │
╞════════════╪═════════════════╪═══════════════════╪═════════════════════════╪═══════╪═════════╡
│ 01048climate_summarycloud_cover_total1990-01-01 00:00:00 UTC100.010.0    │
│ 01048climate_summarycloud_cover_total1990-01-02 00:00:00 UTC100.010.0    │
│ 01048climate_summarycloud_cover_total1990-01-03 00:00:00 UTC91.2510.0    │
│ 01048climate_summarycloud_cover_total1990-01-04 00:00:00 UTC28.7510.0    │
│ 01048climate_summarycloud_cover_total1990-01-05 00:00:00 UTC91.2510.0    │
└────────────┴─────────────────┴───────────────────┴─────────────────────────┴───────┴─────────┘
values.to_pandas() # to get a pandas DataFrame and e.g. create some matplotlib plots

Client

# Get list of all stations for daily climate summary data in JSON format
wetterdienst stations --provider=dwd --network=observation --parameter=kl --resolution=daily --all

# Get daily climate summary data for specific stations
wetterdienst values --provider=dwd --network=observation --station=1048,4411 --parameter=kl --resolution=daily

Further examples (code samples) can be found in the examples folder.

Acknowledgements

We want to acknowledge all environmental agencies which provide their data open and free of charge first and foremost for the sake of endless research possibilities.

We want to acknowledge Jetbrains and the Jetbrains OSS Team for providing us with licenses for Pycharm Pro, which we are using for the development.

We want to acknowledge all contributors for being part of the improvements to this library that make it better and better every day.

wetterdienst's People

Contributors

1maxnet1 avatar amotl avatar brry avatar dependabot[bot] avatar donni-h avatar e-dism avatar gutzbenj avatar ikamensh avatar jhbruhn avatar justus-braun avatar kmuehlbauer avatar korbenga avatar maxbachmann avatar maxnoe avatar meteodaniel avatar mypydavid avatar neumann-nico avatar niclashoyer avatar provinzkraut avatar sanmai-nl avatar xylar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wetterdienst's Issues

get_nearby_stations errors when no max_distances_in_km parameter given

Describe the bug

get_nearby_stations(
    latitudes=[54.785], 
    longitudes=[9.436], 
    period_type=PeriodType.RECENT, 
    parameter=Parameter.TEMPERATURE_AIR, 
    time_resolution=TimeResolution.HOURLY
)

gives traceback

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-50-5656b0016e68> in <module>
----> 1 get_nearby_stations(latitudes=[54.785], longitudes=[9.436], period_type=PeriodType.RECENT, parameter=Parameter.TEMPERATURE_AIR, time_resolution=TimeResolution.HOURLY)

/usr/local/lib/python3.8/site-packages/wetterdienst/additionals/geo_location.py in get_nearby_stations(latitudes, longitudes, parameter, time_resolution, period_type, num_stations_nearby, max_distance_in_km)
     83         num_stations_nearby = metadata.shape[0]
     84 
---> 85     distances, indices_nearest_neighbours = _derive_nearest_neighbours(
     86         metadata.LAT.values, metadata.LON.values, coords, num_stations_nearby
     87     )

/usr/local/lib/python3.8/site-packages/wetterdienst/additionals/geo_location.py in _derive_nearest_neighbours(latitudes_stations, longitudes_stations, coordinates, num_stations_nearby)
    146     points = np.c_[np.radians(latitudes_stations), np.radians(longitudes_stations)]
    147     distance_tree = cKDTree(points)
--> 148     return distance_tree.query(
    149         coordinates.get_coordinates_in_radians(), k=num_stations_nearby
    150     )

ckdtree.pyx in scipy.spatial.ckdtree.cKDTree.query()

TypeError: object of type 'NoneType' has no len()

Expected behavior
The same as if a very high parameter is passed in

get_nearby_stations(
    latitudes=[54.785], 
    longitudes=[9.436], 
    period_type=PeriodType.RECENT, 
    parameter=Parameter.TEMPERATURE_AIR, 
    time_resolution=TimeResolution.HOURLY,
    max_distance_in_km=1e6
)

output

([1666,
  1130,
  4466,
  2437,
  4896,
  4629,
  ...
...

Alternatively it should not be possible to call the method without this parameter (which wouldn't be my prefered solution)

Desktop (please complete the following information):

  • OS: MacOS
  • Python-Version 3.8
  • wetterdienst 0.5.0

Change parameter mapping to individual mapping for each parameter

Related to the #67

As can be seen there, for e.g. solar parameter has no defined period_type. Thus it makes sense to change the mapping to an individual one for each parameter. That way the period type can be defined manually to, say, recent if the solar data approximately covers that time.

Using multiprocessing within high resolution data and boost reading csv

I recently tried out getting data for precipitation, 1_minute, historical. Since I had changed my venv Python version to Python 3.8 I ran into an error with multiprocessing that I could not really resolve. Meanwhile I was already thinking about setting the multiprocessing to multiprocessing.dummy, which is a concurrent (1 core) replacement for the multicore version. This change also prevents the script from blocking the cores (which can happen with up to 100% processor usage).

On the other hand, without even talking about what kind of method (multiprocessing or multiprocessing.dummy) to use to speed up things, still for the chosen data parsing is really slow (I tried it for a station with less data which would still take 2 minutes). So with this in mind, I'd like to try to speed things up a bit regarding reading data from a csv, as it seems this is the bottleneck here.

setup.py imports depedencies

Hi all!
Trying to start a new virtual env for python_dwd fails, because the setup.py has a line
from python_dwd import __version__
which triggers other imports from python_dwd/init.py and leads python to trying to import modules declared as dependencies in the setup.py. This way, pip has no chance to install the dependencies before they are imported.

Best,
Philip

Overhaul metaindex functionality

Similar to the file index the metaindex has to be overhauled and should be stored in cache. Also some functionality has to be simplified e.g. the recursive call on metadata_for_dwd for 1minute precipitation data.

Provide adapters for different ways to store data

Is your feature request related to a problem? Please describe.
no

Describe the solution you'd like
Instead of using single parameters (prefer_local, write_file, folder...) we should provide one argument "store" that is defined with a class. This class can be either HDFStore, SQLiteStore or CSVStore. All classes implement the same functions.

Then instead of using prefer_local etc one would use

DWDStationRequest(
...,
store=HDFStore(prefer_local=True, write_file=True, folder="./")
)

and furthermore withing the methods

# check if to restore file from local drive
if request.store and request.store.prefer_local:  # request.store would default to None
    try:
        request.store.restore_data(request.parameter,...)
    except...

Describe alternatives you've considered
no alternatives considered

Additional context
Add any other context or screenshots about the feature request here.

Improve code formatting

Hi there,

following up on #85, we might want to think about adding code formatting and linting into the pipeline, like what @JessicaS11 just did for icepyx through icesat2py/icepyx#96:

  • The Black code formatter is invoked using a pre-commit action integrated through the pre-commit framework.
  • The Flake8 linter is invoked as a CI task through GitHub Actions.

The outcome of this should not bother us in any way too much than really needed. At least Black seems to adhere to this promise.

With kind regards,
Andreas.

Creation of file index with ftplib.nlst does not include all files

I just discovered that the nlst function does not work as expected as it doesn't seem to reflect the whole file structure as found online.

Example: nlst does not reveal files for station id 3/year 1994/precipitation/1_minute/historical

1_minute/precipitation/historical/1994/1minutenwerte_nieder_00003_19940101_19940131_hist.zip
1_minute/precipitation/historical/1994/1minutenwerte_nieder_00003_19940201_19940228_hist.zip
1_minute/precipitation/historical/1994/1minutenwerte_nieder_00003_19940301_19940331_hist.zip
1_minute/precipitation/historical/1994/1minutenwerte_nieder_00003_19940401_19940430_hist.zip
1_minute/precipitation/historical/1994/1minutenwerte_nieder_00003_19940501_19940531_hist.zip
1_minute/precipitation/historical/1994/1minutenwerte_nieder_00003_19940601_19940630_hist.zip
1_minute/precipitation/historical/1994/1minutenwerte_nieder_00003_19940701_19940731_hist.zip
1_minute/precipitation/historical/1994/1minutenwerte_nieder_00003_19940801_19940831_hist.zip
1_minute/precipitation/historical/1994/1minutenwerte_nieder_00003_19940901_19940930_hist.zip
1_minute/precipitation/historical/1994/1minutenwerte_nieder_00003_19941001_19941031_hist.zip
1_minute/precipitation/historical/1994/1minutenwerte_nieder_00003_19941101_19941130_hist.zip
1_minute/precipitation/historical/1994/1minutenwerte_nieder_00003_19941201_19941231_hist.zip

User friendly parameter defintion

I think it is not very comfortable to source all enumerations everytime.
I think we can build up wrapper with a bit fuzzy logic to determine the Parameter, periodType and TimeResolution.
This is quite similar to issue #8

define_request(parameter='precipitation', time_resolution='10min', start_date?, end_date?,) 

Enumerations are helpful to make the code clean but they are not very user friendly I think.

code under except runs into TypeError: download_data

     77         old_files = Path.glob(Path(folder, SUB_FOLDER_STATIONDATA))
     78 
---> 79         # f"{folder}/{SUB_FOLDER_STATIONDATA}/"
     80 
     81         # For every file in the folder list...

TypeError: glob() missing 1 required positional argument: 'pattern'
```

Add enumerations

Enumerations helps us to pre define possible parameters for variables. It makes the code more readable and nobody can choose a wrong parameter anymore.

Make download multiprocessing ready

The multiprocessing.map is a usefull tool to run e.g. downloads parallel to fasten up the process. We just have to build up a download function that works with a list of stations_ids oder a list url's.

Hello from Wetterdienst

Dear @gutzbenj and @meteoDaniel,

first things first: We would like to salute you for conceiving the best-in-class DWD CDC data access library for Python. You are doing an excellent job here.

Now, let me give you a short background about us: We are a small group of like-minded people [1] interested in meteorological data. Based on the tremendous work coming from @wetterfrosch, we operate a data acquisition server for the community with beautiful visualizations in Grafana [2] which might also spark your interest ¹.

While the data processing machinery there is still based on some quickly ramped up bash programs, we took the chance to take over development of dwd-weather [3] by @marians and contributors on our journey and continued the development within dwdweather2 [4,5].

As dwdweather2 is still based on ancient Python technologies, we have been very much looking forward to throw out the underpinnings and replace them through a solid framework based on Pandas. Thankfully, @jlewis91 already started working on that within his dwdbulk [6] implementation the other day.

Now, after discovering your library, this leaves absolutely nothing to wish for. So, we would like to abandon the dwdweather2 library in the long run and look forward to base our work upon the comprehensive work you are doing here.

Saying this, we discovered some spots in the code base we would like to put some humble effort in and hope that you might like our contributions. Especially, #21 and #49 resonated very much with us ².

Cheers and keep up the spirit,
Andreas.

[1] https://community.panodata.org/
[2] https://weather.hiveeyes.org/
[3] https://github.com/marians/dwd-weather
[4] https://github.com/panodata/dwdweather2
[5] https://pypi.org/project/dwdweather2
[6] https://github.com/jlewis91/dwdbulk

¹ Just press "ESC" and discover more dashboards by exploring the respective list in the upper left corner.

² ... so, based on that, we already started hacking on a cli.py module in order to bring the functionality of python_dwd to the command line. While doing that, we discovered some minor quirks at some ends but we will create different issues/pull requests for them in order not to hijack/clutter this round of introduction and the conversation which might arise from here.

download of hourly historical solar data is invalid

Describe the bug
using the function "collect_climate_observations_data()" returns an InvalidParameterCombination Exception if using the attribute combination

  • parameter = Parameter.SOLAR
  • time_resolution= TimeResolution.HOURLY
  • period_type= PeriodType.HISTORICAL

To Reproduce
call function for a weather station with known measurement of solar radiation and use attributes mentioned above.

Expected behavior
As noticeable at https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/solar/ (Original Data Source) historical hourly solar data is available at the data source

Desktop (please complete the following information):

  • OS: Windows
  • Python-Version 3.6 and 3.7

Precise try except statements

There are several try and except statements and they are capturing several Exceptions. Please specify the one or two possible exception you want to capture and name them explicit in the except statement

Acquire and aggregate/merge data for multiple sets of parameters

Hi there,

there is one last thing left I would like to leave a note about here regarding bringing all features of dwdweather2 to this code base. It has no priority and rather should serve as a friendly reminder to my future self.

dwdweather has the functionality to acquire data across multiple parameters aka. categories like

dwdweather weather 2667 2020-06-01T15:00 --categories air_temperature precipitation pressure

The corresponding wannabe command for wetterdienst would be

wetterdienst readings --resolution=hourly --period=recent --parameter=air_temperature,precipitation,pressure --date=2020-06-30 --station=1048

I recognize this might be some amount of work, so I will be happy to revisit this issue at some time in the future in order to work towards that goal.

The background on this is that we would like to add features for data export into InfluxDB, PostGIS and other databases to wetterdienst.cli, where information from multiple parameters should be stored within a single database collection/table. cc @wetterfrosch

With kind regards,
Andreas.


P.S.: The output of the dwdweather invocation outlined above was like:

{
    "station_id": 2667,
    "datetime": 2020060115,
    "air_temperature_200": 27.0,
    "air_temperature_quality_level": 1,
    "relative_humidity_200": 26.0,
    "precipitation_fallen": 0,
    "precipitation_form": null,
    "precipitation_height": 0.0,
    "precipitation_quality_level": 1,
    "pressure_msl": 1019.0,
    "pressure_quality_level": 1,
    "pressure_station": 1007.6,
}

In order to reach that level of awesomeness, #54 will need some more love to cover all humanized field names for all parameter sets.

Obtaining metadata about specific data sets

Hi there,

together with @wetterfrosch, we are also looking forward to bring metadata about different datasets into this program. The best way to do that would be by directly ingesting resources available on the DWD CDC server or elsewhere from the content published by DWD. Hereby, I wanted to start collecting some resources.

Weather observations

1. Field names and descriptions (German) as XLSX.

2. Field names and descriptions per data set (English and German) as PDF.

Field names per data set are available within the respective folder within the "Parameters" section of corresponding PDF files, for example see data set description for recent daily station observations. Within dwd_description_pdf.py, I've tried to get some Python together to parse these PDF files and it seems to work reasonably, e.g. for 10_minutes/air_temperature/recent:

+-------------+--------------------------------------------------------------------------+
| STATIONS_ID | station identification number                                            |
| MESS_DATUM  | measurement time yyyymmddhhmi                                            |
| QN          | quality level of next columns coding see paragraph "Quality information" |
| PP_10       | pressure at station height hPa                                           |
| TT_10       | air temperature at 2m height                                             |
| TM5_10      | air temperature at 5cm height                                            |
| RF_10       | relative humidity at 2m height                                           |
| TD_10       | dew point temperature at 2m height                                       |
+-------------+--------------------------------------------------------------------------+

MOSMIX forecasts

  • Within #70 (comment), I've collected some resources which describe fields available for MOSMIX.

With kind regards,
Andreas.

Monthly and annual data need MESS_DATUM_BEGINN and MESS_DATUM_ENDE columns

Hi there,

when trying to ingest data from the annual and monthly resolutions, we found that the FROM_DATE and TO_DATE fields are actually called MESS_DATUM_BEGINN and MESS_DATUM_ENDE, not VON_DATUM and BIS_DATUM as currently implemented for other resolutions.

https://github.com/gutzbenj/python_dwd/blob/d13a904adf647f8a49c53d9c9fe3547ea4568ee3/python_dwd/enumerations/column_names_enumeration.py#L17-L18

So, we might think about evolving that data structure from an enumeration into a mapping like already implemented for other data structures.

https://github.com/gutzbenj/python_dwd/blob/d13a904adf647f8a49c53d9c9fe3547ea4568ee3/python_dwd/constants/parameter_mapping.py#L53-L82

With kind regards,
Andreas.

Filtering by distance/radius

Hi there,

Introduction

This is just a suggestion for a possible future improvement of the geospatial support of this library.

Status quo

With #99, filtering by geoposition/count will be possible from the command line by building upon get_nearest_station().
This enables filtering the list of all stations by a number of stations nearby a specific geoposition.

Outlook

It would be cool to also unlock filtering by geoposition/distance, like @stianchris added through panodata/dwdweather2#16 the other day. This means the library can obtain and process a real value in meters used as a radius.

Cheers,
Andreas.

Make Sphinx autodoc work

We found at #153 (comment) that sphinx.ext.autodoc does not work on RTD yet, see Module library > API.

From the past, I know this is mostly a matter of getting the package itself installed properly into the Python environment where Sphinx will be building the documentation. When doing that, there's usually no way around also installing all its dependencies, which occasionally has problems within (restricted) CI environments.

Documentation has room for improvements

Is your feature request related to a problem? Please describe.
The documentation is incomplete and outdated. For example this page: https://wetterdienst.readthedocs.io/en/latest/pages/api.html#api-for-historical-weather-data references wetterdienst.collect_dwd_data, which is not existing anymore.

A lack of documentation makes it harder for new users to adopt to this library and may cause some people to turn it down. Wrong documentation is even worse than no documentation.

Describe the solution you'd like

  • Add a quickstart page that basically just shows what is currently on the API-page
  • Make use of doctest to ensure whatever is shown there is still possible to do
  • Make use of of autodoc to create an API page automatically from the docstrings without any additional effort.
    • To do that, don't document the the __init__ method of a class but instead the class itself - this seems pretty standard. For reference, pandas handles it the same way
  • Make use of sphinx-autodoc-typehints to automatically generate the typing information in the documentation from the type hints

Additional context

I was curious to test the various bits and pieces. If you're interested, I could open a PR where they're cobbled together on the weekend.

Reorganize creation of fileindex

I discovered a big downside of the current way the fileindex is created. The fileindex is currently created within create_file_list_for_dwd_server:

if create_new_filelist or not Path(filelist_local_path).is_file(): create_fileindex(parameter=parameter, time_resolution=time_resolution, period_type=period_type, folder=folder)

The parameter create_new_filelist is only set once for a request, but as can be seen in the snippet above, will always create a new fileindex within create_file_list_for_dwd_server if set to True. The process of creating the fileindex should rather be decoupled from the file list creation and create_file_list_for_dwd_server should instead raise an Error if no filelist has been created.

Data type anomaly for specific fields within daily data

As outlined within panodata/dwdweather2#27, acquiring precipitation information from daily observations should yield data like

"precipitation_form": 4,
"precipitation_height": 8.8,

However, I just found out that

wetterdienst readings --resolution=daily --parameter=kl --period=recent --station=44

will yield data like

"precipitation_form":4.0,
"precipitation_height":1.5,

So, we should adjust the data type for precipitation_form to be an Integer, like designated within the dwdweather2 knowledge base module, line 142.

cc @BenjaminMews

Refactor zipfile handling

The zipfile handling (saving, opening, deleting) should be replaced by io objects (e.g. BytesIO) so that nothing except final files like metadata is attached to the hard drive. This also helps us, as no file has to be deleted after the process of file extraction.

Add support for RADOLAN data

Introduction

@gutzbenj and I had a short conversation about this today and the topic also has been brought up/discussed within jeremiahpslewis/dwdbulk#5 and panodata/dwdweather2#26.

Thoughts

While it will totally make sense to build upon gems like wradlib and/or radproc, there might be room to streamline the lowlevel raw-data-acquisition aspects into python_dwd's machinery.

I would also like to point out respective suggestions from @mmaelicke at panodata/dwdweather2#26 (comment) here - thanks again! They touch both functional and operational aspects. Their gist is:

  • It would be cool to fit acquisition of RADOLAN data into some kind of caching logic to save network roundtrips on subsequent invocations.
  • A possible use-case would be to acquire data based on a list of stations in order to first download meteorological data and then the RADOLAN data covering the stations downloaded.
  • While respective functionality/glue-code to combine different libraries can be easily and quickly put together through custom programs, it is exactly these scripts that screw up the toolchains more often than not. So, when working in a remote cluster, maintained packages and programs are always the better choice.

Replace ftp service by https service for better reliability

The idea is to replace the old school ftp interaction with a new, https based service provided at https://opendata.dwd.de/ . This requires a function to create a listing of available files and options to download the individual station data. Also I believe that a global requests.Session is required to improve the overall performance of downloading data.

Packaging and publishing

Introduction

@gutzbenj and I recently talked about introducing appropriate packaging for this fine project and finally pushing the result to PyPI.

Naming things

I found python_dwd a bit quirky, so I proposed a different, more fluid-sounding name for it: "pydwd". This follows the naming scheme of many popular packages on PyPI where the py prefix is used to associate the language thing (Python) with a binding to another thing (here: DWD). As the name is not yet taken on PyPI, we should go for it.

Versioning

When there are no objections, we might want to bump down to 0.1.0 and then progressively increase the version number while we go. Also, properly maintaining the changelog will do no harm.

Release pipeline / Continuous delivery

While I used a combination of Makefile targets together with bumpversion to cut new releases within many other projects from my pen, @gutzbenj aims for integrating the whole thing through GitHub CI / GitHub Actions. While I am not fluid with these things yet, I totally second that approach. Kudos to @meteoDaniel for already starting this matter.

Using modern standards for describing project metadata

Associating metadata with Python projects has a long history in form of a more or less bumpy road, it looks like the most modern thing is deprecating setup.py completely and - after setup.cfg was a thing for some time, now using a pyproject.toml file seems to be en vogue. Just using requirements.txt is way not enough as this only outlines the dependencies of the project without taking care of the other metadata required to describe the project thoroughly.

In order to learn more about that, reading PEP-517, PEP-518 and PEP-621 makes sense. While this is also new to me, we should definitively go for it.

Using modern tools for project management

Notwithstanding the above, I want to note that Poetry has gained a huge momentum these days for managing Python projects and is compliant with PEP-517, see https://python-poetry.org/docs/pyproject/.

Test automation

While I also mentioned Tox for automating testing while talking to Benjamin recently, it looks like Nox is the new kid on the block in this field. We should use it.

The plan

The fine article Hypermodern Python Chapter 6: CI/CD by Claudio Jolowicz thankfully outlines a blueprint how to combine GitHub Actions with Poetry and Nox and illuminates many aspects around that workflow which leaves nothing to wish for.

setup.py imports dependencies

Hi all!
Trying to start a new virtual env for python_dwd fails, because the setup.py has a line
from python_dwd import __version__
which triggers other imports from python_dwd/init.py and leads python to trying to import modules declared as dependencies in the setup.py. This way, pip has no chance to install the dependencies before they are imported.

Best,
Philip

Summarize attributes into a request object

Several times var(parameter) res (time_resolution) and per(period_type) are parsed to functions. It would make sense to create a dataclass holding these attributes. Within this way we can guarantee all required information are correctly parsed to the different functions.

Fix pathbuilding for ftp client

The ftp client requires a strict format for provided links which in detail means that separators are slashes (not backslashes!). Therefor the usually used Path is not suitable to build the path, but instead we should use PurePosixPath, which always uses regular slashes. This would be independent from the operating system that is used.

Update available data

I have found new data:

wind_synop : hourly
dewpoint : hourly
weather_phenomena : daily, monthly , annual

Best regards

Adding tests

We have to add test for every single function. I would suggest to use pytest for this.

Under the tests folder you copy the project structure. Alle files beginning with test_{python_script}.py will be tested.

A test looks like test_function.py:

from function import function_a

def test_funtion_a():
    to_test = function_a(1,2)
    assert to_test == 3

You can use: np.testing.assert.array_equal or almost_equal and assert works for list too.
Pandas testing is also available.

New methods for collecting and restoring data

The current concept of retrieving data has some drawbacks regarding interaction with locally saved data. The main problem with the current set of functions is that at some point we loose information about which data was originally requested. This happens especially at the download function as it would usually only take the link to a file and return its bytes after downloading it. Because the function is set to produce output that can be further processed by the parsing function, we need to also return additional information which doesn't really match the context of the function.

Another problem with the current setup is that the parsing function also should interact with the local file, which makes its internals much more complicated.

To improve the situation we need to introduce another function that:

  • represents the common pipeline of data retrieval
    By calling the existing functions we have not much work to do. However we can simplify the single functions as we dont loose information about what the original request consisted of.
  • organizes alternative data stream from local file by calling another function
    The introduction of another function that takes data from the local file, we can very much simplify the parsing function. Instead we could try to take data from the local file and if unsuccessful gather data from the online feed.

The above concept should lead to the following new functions:

restore_dwd_data -> interacts with locally stored file
collect_dwd_data -> calls either restore_dwd_data or the original set of functions

Improve the metadata collection function

The German Weather Service has stored his metadata in a poor format, which has no seperators (neither comma nor tab). Instead of reading the data like we currently do it (rowwise reading and checking of number of chunks, then processing it with respect to this chunk number),
`

from fix_metaindex

metaindex_to_fix = metaindex.iloc[:, 6:]

# Reduce the original dataframe by those columns
metaindex = metaindex.iloc[:, :6]

# Index is fixed by string operations (put together all except the last
# string which refers to state)
metaindex_to_fix = metaindex_to_fix \
                       .agg(lambda data: [string
                                          for string in data
                                          if string is not None], 1) \
                       .to_frame() \
                       .iloc[:, 0] \
    .agg(lambda data: [' '.join(data[:-1]), data[-1]]) \
    .apply(pd.Series)

# Finally put together again the original frame and the fixed data
metaindex = pd.concat([metaindex, metaindex_to_fix], axis=1)

`

we should find another way, where we can read it directly with pandas with some predefinitions like read_fwf (fixed width formatted). This should give as a performance boost and also fix some other issues with string handling.

Overhaul column type casting

Hi again,

following up on #107 and #108, I am thinking about converging all of this at the end of the _parse_dwd_data() function:

# Properly handle timestamps from "hourly" resolution, subdaily also has hour in timestamp.
if time_resolution in [TimeResolution.HOURLY, TimeResolution.SUBDAILY]:
    data[DWDMetaColumns.DATE.value] = pd.to_datetime(
        data[DWDMetaColumns.DATE.value], format=DatetimeFormat.YMDH.value)

# Coerce the data types appropriately.
data = data.astype(create_station_data_dtype_mapping(data.columns))

# Coerce Integer data types.
coerce_integer_columns(data)

into a single function

def coerce_column_types(df: pd.DataFrame, time_resolution: TimeResolution)

which would integrate everything from create_station_data_dtype_mapping() and coerce_integer_columns() and will get this off the back of _parse_dwd_data().

What do you think about this?

With kind regards,
Andreas.

Using meaningful column/field names

Dear Benjamin and Daniel,

for starting a discussion around assigning meaningful English names to meteorological short identifiers, I wanted to humbly point out eaaf936 coming from our PR #55.

We had similar things within the knowledgebase module knowledge.py of dwdweather2 and recognized you also already started to put efforts into the aspect of appropriately mapping DWD-specific original parameters, names and such to identifiers which are more suitable for human consumption.

So, we would like to ask you if you appreciate the approach to expand that very aspect on all field names available?

With kind regards,
Andreas.

Cannot import name 'metadata_for_dwd_data' from 'wetterdienst'

Describe the bug
I tried to run the given example. Unfortunately I failed with the following error message:
ImportError: cannot import name 'metadata_for_dwd_data' from 'wetterdienst'.

To Reproduce
Just run the given example.

Desktop (please complete the following information):

  • OS: Mac-OS
  • Python-Version 3.8 (pyenv)

get_nearby_stations only returns closest stations for first passed in coordinate pair

Describe the bug

get_nearby_stations(
    latitudes=[47.996, 54.785], 
    longitudes=[7.852, 9.436], 
    period_type=PeriodType.RECENT, 
    parameter=Parameter.TEMPERATURE_AIR, 
    time_resolution=TimeResolution.HOURLY, 
    max_distance_in_km=50
)

The distances are given for both coordinate pairs. The station ids however only for the first pair.

([13667, 1443, 3702, 1224, 757, 1346],
 [[0.931386122764186, 9.07539313380418],
  [3.6084938122049888, 19.144088575340803],
  [13.372196655985723, 31.254981236882188],
  [15.879028751331843, 37.6572565211981],
  [16.68885088178012, 43.13233288333717],
  [21.592568859612285, 44.26543638304075]])

Expected behavior
Also contain the station ids for which the distances are given.

But to be honest, I find the interface of this method rather quirky. Firstly, I'd propose to pass in a list of coordinate pairs instead of a list of latitudes and longitudes. If I want to lookup the closest stations for a couple of points, it's most likely that I have them stored as coordinate pairs, so it would be much easier to pass them in directly instead of converting them to a list of latitudes and longitudes

get_nearby_stations(
    coordinates=[(47.996, 7.852), (54.785, 9.436)], 
    period_type=PeriodType.RECENT, 
    parameter=Parameter.TEMPERATURE_AIR, 
    time_resolution=TimeResolution.HOURLY, 
    max_distance_in_km=50
)

And probably I would in such a case be interested in the closest station to the individual points. To not rely on the order (which requires rather complex lookup code), a return type like so would be easier from a users perspective

{
  (47.996, 7.852): [[13667, 1443, ...], [0.93, 3.6, ...],
  (54.785, 9.436): ...
}

But actually I'd ask the question whether it's the possibility to request the closest stations for multiple locations is an actual requirement at all. Given the fact this bug is unnoticed yet, I presume it's not been used at least. If you want to allow fetching of the closest stations for multiple locations without having repetitive requests, an easy solution could be wrapping this in a class and store the results of the requests to the DWD in an instance variable, e.g.

>>> station_fetcher = StationFetcher()
>>> station_fetcher.get_closest_stations((47.996, 7.852))  # fetches the station list from the DWD server and caches it
[13667, 1443, ...]

>>> station_fetcher.get_closest_stations((54.785, 9.436))  # only accesse the cached data
[...]

Ideally this wouldn't only return the station id, but a more complex object with station metadata, such as the name and location.

I'd be happy to help with an implementation, just let me know

Desktop (please complete the following information):

  • MacOS
  • 3.8
  • wetterdienst 0.5.0

Replace high resolution metadata retrieving process with something proper

The current way of retrieving metadata for precipitation/1_minute/historical includes a part where we had ran into an issue with some files and therefor had used two try-except-blocks with each trying a different engine (c, python) to ensure a working function.
`

create_metaindex2

        with zip_file.open(file_geo) as file_opened:
            try:
                geo_file = pd.read_csv(filepath_or_buffer=TextIOWrapper(file_opened),
                                       sep=";",
                                       na_values="-999")
            except UnicodeDecodeError:
                geo_file = pd.read_csv(filepath_or_buffer=TextIOWrapper(file_opened),
                                       sep=";",
                                       na_values="-999",
                                       engine="python")

        with zip_file.open(file_par) as file_opened:
            try:
                par_file = pd.read_csv(filepath_or_buffer=TextIOWrapper(file_opened),
                                       sep=";",
                                       na_values="-999")

            except UnicodeDecodeError:
                par_file = pd.read_csv(filepath_or_buffer=TextIOWrapper(file_opened),
                                       sep=";",
                                       na_values="-999",
                                       engine="python")

`
However we should try to replace this method with something consistent using one engine and, in case needed, remodel the incoming Bytes/String-like object to work properly.

Add support for forecast information

Introduction

We would like to add support to acquire forecast information from DWD, for both MOSMIX-L and MOSMIX-S.

Background

@drmrbrewer and @onlygecko already asked for this within panodata/dwdweather2#2.

Thoughts

Together with @jlewis91, we already planned to merge dwdweather2 with dwdbulk somehow, but now we should consider to add missing functionality here.

@jlewis91' convert_xml_to_pandas method from forecasts.py might be used 1:1 already. Note that this one has originally been written to read MOSMIX-S only.

get_nearby_stations returns abandoned stations

Describe the bug

get_nearby_stations(
    latitudes=[47.996], 
    longitudes=[7.852], 
    period_type=PeriodType.RECENT, 
    parameter=Parameter.TEMPERATURE_AIR, 
    time_resolution=TimeResolution.HOURLY,
    max_distance_in_km=10
)

returns

([13667, 1443], [[0.931386122764186], [3.6084938122049888]])

13667 maps to Freiburg Mitte which was only operational from 11-2006 to 07-2008 (in fact I think the station Freiburg has moved and the old station has been continued operational under another name but this is unrelated to the issue).

Expected behavior

When requesting recent data, I'd expect to only get data from stations that are still operational. But also if I want to get historic data, I wouldn't expect a station in there that is discontinued for over 10 years, because I might be interested in data from 5 years ago. Two solutions come to mind:

  1. Add a parameter to specify the date at which one is looking for stations. It could default to datetime.now()
  2. Return more than just the ids of the stations but also metadata (name, location, (de)commissioning dates)

In fact I'd propose to implement both of them.

Additional context

List of stations:
https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/air_temperature/historical/TU_Stundenwerte_Beschreibung_Stationen.txt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.