enram / vptstools Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 1.0 3.49 MB

Python library to transfer and convert vertical profile time series data

Home Page: https://enram.github.io/vptstools/

License: MIT License

Python 99.83% Dockerfile 0.17%

aeroecology oscibio weather-radar

vptstools's People

Contributors

Stargazers

Watchers

Forkers

thejenne18

vptstools's Issues

Provide some more info in AWS notifications

I'm getting daily AWS notifications with:

CLI routine 'vph5_to_vpts --modified-days-ago 2' failed raising error: '<class 'ValueError'>: File name is not a valid ODIM h5 file.'.

It would be helpful to:

Know what file failed
~~Where the log files can be found (Cloudwatch link)~~ decided to keep this part out of functionality

Improve log handling on server

The current configuration on the server for the logs does not take into account log-rotation:

add logrotate config to paramiko log
convert current crontab-timestamp ...>> /home/ubuntu/transfer-date +%Y%m%d%H%M%S.log 2>&1 in logs to a log to file and adding logrotate config on server-level.

Cannot install dotenv dependency

vph5_to_vpts requires dotenv, but it seems the installation fails:

vph5_to_vpts                                                
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.11/bin/vph5_to_vpts", line 5, in <module>
    from vptstools.bin.vph5_to_vpts import cli
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/vptstools/bin/vph5_to_vpts.py", line 10, in <module>
    from dotenv import load_dotenv

pip3 install dotenv
Collecting dotenv
  Using cached dotenv-0.0.5.tar.gz (2.4 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... error
  error: subprocess-exited-with-error
  
  × pip subprocess to install backend dependencies did not run successfully.
  │ exit code: 1
  ╰─> [29 lines of output]
      Collecting distribute
        Using cached distribute-0.7.3.zip (145 kB)
        Installing build dependencies: started
        Installing build dependencies: finished with status 'done'
        Getting requirements to build wheel: started
        Getting requirements to build wheel: finished with status 'done'
        Preparing metadata (pyproject.toml): started
        Preparing metadata (pyproject.toml): finished with status 'error'
        error: subprocess-exited-with-error
      
        × Preparing metadata (pyproject.toml) did not run successfully.
        │ exit code: 1
        ╰─> [6 lines of output]
            usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
               or: setup.py --help [cmd1 cmd2 ...]
               or: setup.py --help-commands
               or: setup.py cmd --help
      
            error: invalid command 'dist_info'
            [end of output]
      
        note: This error originates from a subprocess, and is likely not a problem with pip.
      error: metadata-generation-failed
      
      × Encountered error while generating package metadata.
      ╰─> See above for output.
      
      note: This is an issue with the package mentioned above, not pip.
      hint: See above for details.
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× pip subprocess to install backend dependencies did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

As a reference here - adjust crow vtps reader for csv with duplicate dates

inbo/crow#179

GitHub workflows

test.yml: run for every commit, one Python version / for PR, all Python versions
documentation.yml: for every commit to main, run test, run Sphinx, push to gh-pages branch
release.yml: run for commits with a tag, run tests for all Python version, push to Pypi

Cannot unzip gzip data

Monthly data, e.g. https://aloft.s3-eu-west-1.amazonaws.com/baltrad/monthly/bejab/2023/bejab_vpts_202302.csv.gz return an error when trying to unzip on Mac:

Documentation build warning

When running tox -e docs, all works fine, but a warning is raised:

/Users/peter_desmet/Coding/Repositories/enram/vptstools/src/vptstools/__init__.py:docstring of vptstools.vpts.vpts:1: WARNING: duplicate object description of vptstools.vpts, other instance in api/vptstools, use :no-index: for one of them
WARNING: autodoc: failed to import module 'transfer_baltrad' from module 'vptstools.bin'; the following exception was raised:
cannot import name 'report_exception_to_sns' from 'vptstools.bin.click_exception' (/Users/peter_desmet/Coding/Repositories/enram/vptstools/docs/../src/vptstools/bin/click_exception.py)

Publish to PyPI (under the vptstools name, so the doc is correct)

Cannot create coverage file locally on Mac

tox -e dev
source venv/bin/activate
tox

Tests works fine until 98% after which I get an error:

tests/test_vpts_csv.py::TestVptsCsvV1SupportFun::test_check_source_file PASSED                                                                                            [ 98%]
/Users/peter_desmet/Coding/Repositories/enram/vptstools/.tox/py39/lib/python3.9/site-packages/coverage/data.py:166: CoverageWarning: Couldn't use data file '/Users/peter_desmet/Coding/Repositories/enram/vptstools/.coverage.Peters-MacBook-Air.local.50554.789008-journal': file is not a database
  data._warn(str(exc))
tests/test_vpts_csv.py::TestVptsCsvV1SupportFun::test_check_source_file_wrong_file PASSED                                                                                 [100%]
INTERNALERROR> Traceback (most recent call last):
INTERNALERROR>   File "/Users/peter_desmet/Coding/Repositories/enram/vptstools/.tox/py39/lib/python3.9/site-packages/_pytest/main.py", line 270, in wrap_session
INTERNALERROR>     session.exitstatus = doit(config, session) or 0
INTERNALERROR>   File "/Users/peter_desmet/Coding/Repositories/enram/vptstools/.tox/py39/lib/python3.9/site-packages/_pytest/main.py", line 324, in _main
INTERNALERROR>     config.hook.pytest_runtestloop(session=session)
INTERNALERROR>   File "/Users/peter_desmet/Coding/Repositories/enram/vptstools/.tox/py39/lib/python3.9/site-packages/pluggy/_hooks.py", line 265, in __call__
INTERNALERROR>     return self._hookexec(self.name, self.get_hookimpls(), kwargs, firstresult)
INTERNALERROR>   File "/Users/peter_desmet/Coding/Repositories/enram/vptstools/.tox/py39/lib/python3.9/site-packages/pluggy/_manager.py", line 80, in _hookexec
INTERNALERROR>     return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
INTERNALERROR>   File "/Users/peter_desmet/Coding/Repositories/enram/vptstools/.tox/py39/lib/python3.9/site-packages/pluggy/_callers.py", line 55, in _multicall
INTERNALERROR>     gen.send(outcome)
INTERNALERROR>   File "/Users/peter_desmet/Coding/Repositories/enram/vptstools/.tox/py39/lib/python3.9/site-packages/pytest_cov/plugin.py", line 297, in pytest_runtestloop
INTERNALERROR>     self.cov_controller.finish()
INTERNALERROR>   File "/Users/peter_desmet/Coding/Repositories/enram/vptstools/.tox/py39/lib/python3.9/site-packages/pytest_cov/engine.py", line 44, in ensure_topdir_wrapper
INTERNALERROR>     return meth(self, *args, **kwargs)
INTERNALERROR>   File "/Users/peter_desmet/Coding/Repositories/enram/vptstools/.tox/py39/lib/python3.9/site-packages/pytest_cov/engine.py", line 242, in finish
INTERNALERROR>     self.cov.stop()
INTERNALERROR>   File "/Users/peter_desmet/Coding/Repositories/enram/vptstools/.tox/py39/lib/python3.9/site-packages/coverage/control.py", line 807, in combine
INTERNALERROR>     combine_parallel_data(
INTERNALERROR>   File "/Users/peter_desmet/Coding/Repositories/enram/vptstools/.tox/py39/lib/python3.9/site-packages/coverage/data.py", line 148, in combine_parallel_data
INTERNALERROR>     with open(f, "rb") as fobj:
INTERNALERROR> FileNotFoundError: [Errno 2] No such file or directory: '/Users/peter_desmet/Coding/Repositories/enram/vptstools/.coverage.Peters-MacBook-Air.local.50618.245710-journal'

Comparison of hdf5/daily/monthly files

HDF5

library(dplyr)
library(readr)
library(bioRad)
files <- list.files("~/Downloads/bejab/aloft/hdf5", full.names = TRUE)
vp <- bioRad::read_vpfiles(files)
vpts <-
  bioRad::bind_into_vpts(vp) %>%
  as.data.frame(geo = TRUE, suntime = FALSE) %>%
  dplyr::arrange(datetime, height)
readr::write_csv(vpts, "vpts.csv") # This converts NaN to NA, I have manually set those back to NaN below
nrow(vpts)
# 35575

radar	datetime	ff	dbz	dens	u	v	gap	w	n_dbz	dd	n	DBZH	height	n_dbz_all	eta	sd_vvp	n_all	lat	lon	height_antenna
bejab	2023-02-02T00:00:00Z	0	NaN	-2.578521966934204	18.094850540161133	NaN	NaN	1	NaN	2237	NaN	227	25.863487243652344	8821	199.04335021972656	2.0527188777923584	436	51.191700000000004	3.0642000000000005	50
bejab	2023-02-02T00:00:00Z	0	NaN	-3.3399300575256348	15.184979438781738	NaN	NaN	1	NaN	1568	NaN	163	24.80197525024414	8828	167.03477478027344	2.5198092460632324	456	51.191700000000004	3.0642000000000005	50
bejab	2023-02-02T00:00:00Z	0	NaN	-3.988534688949585	13.078372955322266	NaN	NaN	1	NaN	2122	NaN	223	23.728849411010742	8862	143.8621063232422	2.4304895401000977	444	51.191700000000004	3.0642000000000005	50
bejab	2023-02-02T00:00:00Z	200	3.0408787727355957	-7.179384708404541	6.272905349731445	2.805347204208374	-1.1734440326690674	0	-21.234346389770508	19771	112.69898986816406	740	16.758867263793945	22850	69.00196075439453	2.9160335063934326	873	51.191700000000004	3.0642000000000005	50
bejab	2023-02-02T00:00:00Z	200	3.2996182441711426	-7.66164493560791	5.613616466522217	2.8031976222991943	-1.740564227104187	0	5.670511245727539	18076	121.8370132446289	568	18.90127182006836	22860	61.749778747558594	2.4480631351470947	866	51.191700000000004	3.0642000000000005	50

Daily

d1 <- readr::read_csv("~/Downloads/aloft/bejab/bejab_vpts_20230202.csv")
d2 <- readr::read_csv("~/Downloads/aloft/bejab/bejab_vpts_20230203.csv")
d3 <- readr::read_csv("~/Downloads/aloft/bejab/bejab_vpts_20230213.csv")
d4 <- readr::read_csv("~/Downloads/aloft/bejab/bejab_vpts_20230214.csv")
d5 <- readr::read_csv("~/Downloads/aloft/bejab/bejab_vpts_20230215.csv")
daily <- bind_rows(d1, d2, d3, d4, d5)

radar	datetime	height	u	v	w	ff	dd	sd_vvp	gap	eta	dens	dbz	dbz_all	n	n_dbz	n_all	n_dbz_all	rcs	sd_vvp_threshold	radar_latitude	radar_longitude	radar_height	radar_wavelength	source_file
bejab	2023-02-02T00:00:00Z	0	NaN	NaN	NaN	NaN	NaN	2.0527188777923584	TRUE	199.04335021972656	18.094850540161133	-2.578521966934204	25.863487243652344	227	2237	436	8821	11.0	2.0	51.1917	3.0642	50	5.3	bejab_vp_20230202T000000Z_0x9.h5
bejab	2023-02-02T00:00:00Z	0	NaN	NaN	NaN	NaN	NaN	2.5198092460632324	TRUE	167.03477478027344	15.184979438781738	-3.3399300575256348	24.80197525024414	163	1568	456	8828	11.0	2.0	51.1917	3.0642	50	5.3	bejab_vp_20230202T000500Z_0x9.h5
bejab	2023-02-02T00:00:00Z	0	NaN	NaN	NaN	NaN	NaN	2.4304895401000977	TRUE	143.8621063232422	13.078372955322266	-3.988534688949585	23.728849411010742	223	2122	444	8862	11.0	2.0	51.1917	3.0642	50	5.3	bejab_vp_20230202T001000Z_0x9.h5
bejab	2023-02-02T00:00:00Z	200	2.805347204208374	-1.1734440326690674	-21.234346389770508	3.0408787727355957	112.69898986816406	2.9160335063934326	FALSE	69.00196075439453	6.272905349731445	-7.179384708404541	16.758867263793945	740	19771	873	22850	11.0	2.0	51.1917	3.0642	50	5.3	bejab_vp_20230202T000000Z_0x9.h5
bejab	2023-02-02T00:00:00Z	200	2.8031976222991943	-1.740564227104187	5.670511245727539	3.2996182441711426	121.8370132446289	2.4480631351470947	FALSE	61.749778747558594	5.613616466522217	-7.66164493560791	18.90127182006836	568	18076	866	22860	11.0	2.0	51.1917	3.0642	50	5.3	bejab_vp_20230202T000500Z_0x9.h5

Monthly

monthly <- readr::read_csv("~/Downloads/aloft/bejab_vpts_202302.csv.gz")

radar	datetime	height	u	v	w	ff	dd	sd_vvp	gap	eta	dens	dbz	dbz_all	n	n_dbz	n_all	n_dbz_all	rcs	sd_vvp_threshold	radar_latitude	radar_longitude	radar_height	radar_wavelength	source_file
bejab	2023-02-02T00:00:00Z	0						2.0527188777923584	TRUE	199.04335021972656	18.094850540161133	-2.578521966934204	25.863487243652344	227	2237	436	8821	11.0	2.0	51.1917	3.0642	50	5.3	bejab_vp_20230202T000000Z_0x9.h5
bejab	2023-02-02T00:00:00Z	0						2.5198092460632324	TRUE	167.03477478027344	15.184979438781738	-3.3399300575256348	24.80197525024414	163	1568	456	8828	11.0	2.0	51.1917	3.0642	50	5.3	bejab_vp_20230202T000500Z_0x9.h5
bejab	2023-02-02T00:00:00Z	0						2.4304895401000977	TRUE	143.8621063232422	13.078372955322266	-3.988534688949585	23.728849411010742	223	2122	444	8862	11.0	2.0	51.1917	3.0642	50	5.3	bejab_vp_20230202T001000Z_0x9.h5
bejab	2023-02-02T00:00:00Z	200	2.805347204208374	-1.1734440326690674	-21.234346389770508	3.0408787727355957	112.69898986816406	2.9160335063934326	FALSE	69.00196075439453	6.272905349731445	-7.179384708404541	16.758867263793945	740	19771	873	22850	11.0	2.0	51.1917	3.0642	50	5.3	bejab_vp_20230202T000000Z_0x9.h5
bejab	2023-02-02T00:00:00Z	200	2.8031976222991943	-1.740564227104187	5.670511245727539	3.2996182441711426	121.8370132446289	2.4480631351470947	FALSE	61.749778747558594	5.613616466522217	-7.66164493560791	18.90127182006836	568	18076	866	22860	11.0	2.0	51.1917	3.0642	50	5.3	bejab_vp_20230202T000500Z_0x9.h5

testthat::expect_equal(daily, monthly)

Include vptstools version number in docker

@stijnvanhoey in the call we discovered that the ECS revisions always have latest in their tag. I think that is defined in the release.yml in this repository though:

vptstools/.github/workflows/release.yml

Line 158 in 811e76f

tags: ${{ steps.login-ecr.outputs.registry }}/inbo-aloft:latest

Would you know how to change this to include the version number of vptstools?

Remove `aws` repository

@stijnvanhoey I wonder if we can remove the https://github.com/enram/aws repo. Most commits are made by you.

enram-backup-vp lambda function: I don't think is still used? Is it worth keeping this code?
enram-download-baltrad-vp-alert lambda function: idem
oscibio-cluster: config file. Not sure if this is still used and/or worth to version somewhere?

vph5_to_vpts: implement more input data checks

All input files come from the same radar
All input files have the same number of levels/altitude
File generated by vol2bird (info in how/task)
...

Convert current daily run to CLI functionality on server

from notebook/script -> alos CLI run for the cronjob

Move frictionless to `develop`

vptstools/setup.cfg

Line 54 in f99dc47

frictionless

Add mail notification for csv-vpts generation failures

Check FTP behavior in logs

Are the following issues reported at data-repository still relevant and/or tackled by vptstools?

Monthly run should sort based on the input files day-logic without further operation (except zip)

Make sure to add separate unit test for the duplication fact and sorting.

@peterdesmet what zip-format to use? .zip, .tar.gz,... ?

Use environmental variables for BALTRAD sync

Refactor s3fs (and the aotbotocore) out as a dependency

s3fs is a convenient package to interact with s3 (feels like coding against normal file system), but since s3f3 relies on the async aiobotocore package some issues arised:

due to very strict dependencies in both boto3 and aiobotocore, the pip solver was not able to solve this unless defining boto3 as an additional install from s3fs (see daeb285 for fix)
aiobotocore requires raw_headers which are not according to specs in the moto-library used in the testing (see getmoto/moto#3259 for description and 802b9f5 for the fix)
aiobotocore/s3fs keep in an recursive loop copy-pasting the files in the pytest-context leasing to an ever increasing file size when running the tests. Has been solved by relying on boto3 instead for the download, see 28a5e42

Hence, these issues have been handled in https://github.com/enram/vptstools/tree/SVH-country-filter, but I'm not sure if the convenience of s3fs is worthwhile having it as a dependency. It might be that excluding s3fs would make things easier to maintain.

Provide a full rerun functionality

Need to be able to remove all CSV files (daily/monthly) in the s3 aloft bucket and replace them by a new run with a given schema-version implementation.

Need to be operational to have a rerun after is merged. enram/vpts-csv#42

Investigate MyPy issues

I'd like to (sooner than later) use MyPy to make this package more robust.

A first try give me error: Skipping analyzing "odimh5.reader": found module but no type hints or library stubs which seems weird since the odimh5 page is type-annotated.

To investigate.

Would be useful if coverage.csv also provided info on daily/monthly

The coverage.csv only provides coverage for the hdf5 portion of each source (baltrad, uva, ecog). It would be useful if it also provided the coverage for the daily and monthly portion for each source. That is not a trivial change however, since the coverage is based on the AWS inventory and that inventory is limited to the hdf5 files (because that is the aspect that is important for the vph5_to_vpts).

Verify NULL values in ff, dd , sd_vvp, eta properties

See enram/vpts-csv#40 (comment)

did you also encounter NULL (rather than NaN) values for these properties?

Need to double check this for ff, dd , sd_vvp, eta

Where to drop csv files?

The directory consensus for files is (#enram/data-repository#65 (comment)) is source/format/radar/yyyy/

I suggest:

# source data
baltrad/hdf5/radar/yyyy/mm/dd/ file.h5

# daily unzipped csv
baltrad/daily/radar/yyyy/ file.csv

# monthly gzipped csv
baltrad/monthly/radar/yyyy/ file.csv.gz

Investigate `vcp` error

Logs for entire bucket indicate:

September 06, 2023 at 12:34 (UTC+2:00)[WARNING] - During conversion from HDF5 files of baltrad/bejab at 2018-06-03 to daily VPTS file, the following error occurred: 'vcp'.
sync
September 06, 2023 at 12:33 (UTC+2:00)Create daily VPTS file baltrad/daily/bejab/2018/bejab_vpts_20180603.csv.
sync

What is the vcp error?

Overcome current ad-hoc data fixes by adjustiong vpts-csv standard

vptstools/tests/test_vpts.py

Lines 94 to 97 in 6a69534

    
           # TODO - DUMMY FIXES - CAN BE REMOVED AFTER SCHEMA UPDATES 
        
           df_vpts["vcp"] = "12" 
        
           df_vpts[["u", "v", "ff", "dd", "sd_vvp"]] = df_vpts[["u", "v", "ff", "dd", "sd_vvp"]].replace("NaN", 1) 
        
           df_vpts[["ff", "dd", "sd_vvp", "eta"]] = df_vpts[["ff", "dd", "sd_vvp", "eta"]].replace("", 1)

Link to function documents, make sure they render correctly

In README, link to function documentation. Done in 2cf980b
In README, don't list config variables (should be in function documentation). Done in 2cf980b
Make sure function documentation for transfer_baltrad renders correctly (too few or many line breaks atm)
Make sure function documentation for vph5_to_vpts renders (currently just a title)

See #50 for a start

Cannot reproduce second README example

Using these files that I save in data:

https://aloftdata.s3-eu-west-1.amazonaws.com/baltrad/hdf5/nldbl/2013/11/23/nldbl_vp_20131123T0000Z.h5
https://aloftdata.s3-eu-west-1.amazonaws.com/baltrad/hdf5/nldbl/2013/11/23/nldbl_vp_20131123T0015Z.h5
https://aloftdata.s3-eu-west-1.amazonaws.com/baltrad/hdf5/nldbl/2013/11/23/nldbl_vp_20131123T0030Z.h5

I get an error if I want to reproduce the second README example:

from pathlib import Path
from vptstools.vpts import vpts

file_paths = sorted(Path("./data").rglob("*.h5"))  # Get all h5 files within the data directory
df_vpts = vpts(file_paths)

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/spawn.py", line 120, in spawn_main
    exitcode = _main(fd, parent_sentinel)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/spawn.py", line 129, in _main
    prepare(preparation_data)
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/spawn.py", line 240, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/spawn.py", line 291, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen runpy>", line 291, in run_path
  File "<frozen runpy>", line 98, in _run_module_code
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/peter_desmet/Coding/Repositories/enram/vptstools/test.py", line 5, in <module>
    df_vpts = vpts(file_paths)
              ^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/vptstools/vpts.py", line 256, in vpts
    with multiprocessing.Pool(processes=cpu_count) as pool:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py", line 215, in __init__
    self._repopulate_pool()
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py", line 306, in _repopulate_pool
    return self._repopulate_pool_static(self._ctx, self.Process,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If I print file_paths I get:

[PosixPath('data/nldbl_vp_20131123T0000Z.h5'), PosixPath('data/nldbl_vp_20131123T0015Z.h5'), PosixPath('data/nldbl_vp_20131123T0030Z.h5')]

Move `generate_coverage.py` as function to vptstools

Cf. #17 @stijnvanhoey, you probably have already started with this?

Note that the generate_coverage.py might no longer be needed if we use AWS Inventory.

Not corresponding datetime between h5 filename and h5 content (/what/time, "HHmmss")

I downloaded a set of files from the bejab data as a test case and while trying out the CSV concatenation (to create a vpts-csv), I encountered repeated timestamps for multiple files not corresponding to the file name included timestamp:

FILENAME                         /what/time               
bejab_vp_20221111T233000Z_0x9.h5 233000
bejab_vp_20221111T234000Z_0x9.h5 233000
bejab_vp_20221111T234500Z_0x9.h5 234500
bejab_vp_20221111T235000Z_0x9.h5 234500
bejab_vp_20221111T235500Z_0x9.h5 234500

To check, I downloaded some files from the Baltrad sftp directly and compared the timestamp of the file with the timestamp of the /what/time, leading to several of these differences (67% on my quick test on 50 files):

FILE    WHAT/TIME   FILEPATH
2250    2245        bejab_vp_20221112T225000Z_0x9.h5
0235    0230        bewid_vp_20221113T023500Z_0xb.h5
1635    1630        chppm_vp_20221114T163500Z_0xb.h5
0310    0300        dedrs_vp_20221115T031000Z_0xb.h5
0105    0100        defbg_vp_20221114T010500Z_0xb.h5
1025    1015        deisn_vp_20221115T102500Z_0xb.h5
0125    0115        denhb_vp_20221114T012500Z_0xb.h5
0505    0500        denhb_vp_20221114T050500Z_0xb.h5
1210    1200        eehar_vp_20221113T121000Z_0xb.h5
0410    0400        eehar_vp_20221114T041000Z_0xb.h5
0520    0515        esalm_vp_20221114T052000Z_0xb.h5
1410    1400        esbar_vp_20221113T141000Z_0xb.h5
1420    1415        essse_vp_20221114T142000Z_0xb.h5
1040    1030        esval_vp_20221115T104000Z_0xb.h5
0150    0145        filuo_vp_20221114T015000Z_0xb.h5
0255    0245        finur_vp_20221114T025500Z_0xb.h5
1440    1430        frabb_vp_20221114T144000Z_0xb.h5
0050    0045        frcol_vp_20221115T005000Z_0xb.h5
1835    1830        frmcl_vp_20221114T183500Z_0xb.h5
1340    1330        frmom_vp_20221114T134000Z_0xb.h5
2050    2045        frnim_vp_20221113T205000Z_0xb.h5
0320    0315        frniz_vp_20221113T032000Z_0xb.h5
0640    0630        frtou_vp_20221113T064000Z_0xb.h5
2250    2245        frtra_vp_20221114T225000Z_0xb.h5
0825    0815        frtre_vp_20221113T082500Z_0xb.h5
0605    0600        nohgb_vp_20221115T060500Z_0xb.h5
1555    1545        nosmn_vp_20221113T155500Z_0xb.h5
0020    0015        plram_vp_20221114T002000Z_0xb.h5
2205    2200        sekaa_vp_20221113T220500Z_0xb.h5
0210    0200        sevax_vp_20221113T021000Z_0xb.h5

@peterdesmet is this a known issue or am I stuck on a bug I just can't get around? For the latter experiment I relied only on h5py package as a dependency (I left out the vptstools modules and just tried to extract only the timestamps):

import h5py

file_paths = sorted(Path("../data/raw/baltrad/").rglob("*.h5"))

for j, path_h5 in enumerate(file_paths):
    with h5py.File(path_h5, mode="r") as odim_vp:
        time_filename = path_h5.stem.split("_")[2][9:13]
        time_h5_what = odim_vp["what"].attrs.get("time").decode("utf-8")[:-2]
        if time_filename != time_h5_what:
            print(time_filename, time_h5_what, path_h5)

The time difference might not be an issue if the timestamps are unique among the different files. Or should we rather use the timestamp from the file path of the h5 files?

Run with `--modified-days-ago 0` failed silently

@TheJenne18 ran a full processing with --modified-days-ago 0, using 8 vCPUs en 16GB. The job stopped silently, with no data added to the bucket. The only line in the log file is:

Recreate the full set of bucket files (files modified since 401days). This will take a while!

Which I assume is just the start of the process. Did it fail reading the full inventory in memory? It might be useful to provide more messages, to know at what point the processing failed.

How to handle the gain/offset in the conversion from hdf5 to vpts-csv

In the https://github.com/adokter/vol2bird/wiki/ODIM-bird-profile-format-specification#specification-of-bird-profile-output-in-odim-hdf5-format specification there is a gain and an offset for the datasets/variables. In the conversion from h5 to vpts-CSV, the current implementation does not take these into account. @adokter, should this actually be done by default and store in the vpts-csv version for each record quantity*gain+offset instead of the quantity?

Incorrect author in PyPi

https://github.com/enram/vptstools/blob/main/setup.cfg has been updated to have INBO as author. When I install from PyPi, I still see the old information:

pip3 show vptstools
Name: vptstools
Version: 0.2.2
Summary: Tools to work with vertical profile time series.
Home-page: https://enram.github.io/vptstools/
Author: enram
Author-email: 
License: MIT
Location: /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages
Requires: click, frictionless, h5py, pandas, pytz
Required-by:

How can an updated be forced?

Manage tags/releases

I notice we have many tags and 1 release. @stijnvanhoey:

Is a tag sufficient to have it be picked up by GitHub Actions and used in the operational pipeline? Or does it require a release.
Is a tag sufficient to have it be published to PyPi? Yes, see https://pypi.org/project/vptstools/
Should we create releases for the most recent tags? Maybe only for the minor versions (not hotfixes)
Should we clean up the tags, there seems to be a mixed of usage. Here's my suggestion

Add 'source' column

See enram/vpts-csv#42, implementation wise:

The data record should be the full s3 path: s3://aloft/baltrad...
Added as last column in the mapping
Added in sorting dict(radar=str, datetime=str, height=int, source_file=str)

Flow for converting h5 to VPTS CSV files

Pseudo code:

h5_files = get_h5_files(radar, start, end, source) # returns list of paths
df = empty df
for h5_file in h5_files:
  df = h5_to_df(h5_file)
  append(df, df)
pandas:write_csv(df, "some/path/name.csv")

So:

a custom function get_h5_files() that understand the directory structure of the repo. It likely makes use of the s3 library under the hood to get a list of file paths that match a radar, start, end date, source criterium.
a custom function h5_to_df() that reads a h5 file and converts it to VPTS CSV format, but as a dataframe, not a file. The function can be called many times to build a growing data frame.
a generic write_csv() function (e.g. from pandas) that writes the df to a file at some location. The write_csv() settings should match those of the csv dialect defined for VPTS CSV

Update the CI setup

Points to cover:

check the python versions to which we want the ci (server on which deployed?)
pip-tools integration to provide deployable version with fixed dependency set (+ ci-logic)
deployment sphinx website and pypi (key setup,...)
ci logic of the lint/formatting so it raises errors on message

AWS error references deleted file

I got an AWS notification email today:

CLI routine 'vph5_to_vpts --modified-days-ago 2' failed raising error: '<class 'ValueError'>: File name uva/hdf5/dbl/2008/02/17/nldbl_vp_20080217t0000_nl50_v0-3-20.h5 is not a valid ODIM h5 file.'.

It now includes the name of the file 👍. The mentioned file is however no longer in the repository. It was deleted August 21 or 22 (2 or 1 days ago). I'll see tomorrow if the issue resolves itself, i.e. the inventory is updated and the file is no longer listed there and no notification is generated.

@stijnvanhoey @TheJenne18 can deleted files linger in the inventory? Should this resolve itself automatically? Not sure we considered this when designing the architecture.

Note this error did not stop the creation of the daily and monthly files 👍

Provide environment in exception messages

Suggestion by @TheJenne18 to avoid confusion from what environment notifications are send (#62), it might be useful to provide the os.environment("ENV") in error messages such as:

CLI routine 'vph5_to_vpts --modified-days-ago 2' failed raising error: '<class 'ValueError'>: File name uva/hdf5/dbl/2008/02/17/nldbl_vp_20080217t0000_nl50_v0-3-20.h5 is not a valid ODIM HDF5 file.'.

Sample data for unit testing

@niconoe when running the current unit tests, there is a reference to sample data, which is not available in the repository. Is there a reference or documentation available on the example data setup? Should I just put 'any' h5 file to make the test_error_non_vp_source_file test working (as this should fail)?

Add reproducible example in README

Add section Usage with a simple reproducible example (python or command line code) to show how three h5 files (downloaded from aloft) can be converted to vpts csv. Tackle in #50

Improve upload handling

In order to speed up the uploads towards the S3, the handling of multiple files at the same time would be a huge improvement.
A first option would be working with async, but as the boto3 libraryr currently not yet support async handling, this approach will yet fail to work. Working with multiple threads or working parallel would be an valid option to implement.

@stijnvanhoey Verify odimh5 functionality is integrated in vptstools
@stijnvanhoey Can odimh5 be removed from PyPI? See also enram/odimh5#2. Or be marked as deprecated?
@peterdesmet archive repository: https://github.com/enram/odimh5

Add validate_vpts function optional parameter to reference a specific vpts-csv schema (e.g. version="v1")

vptstools/src/vptstools/vpts.py

Line 476 in 6a69534

    
           "schema": "https://raw.githubusercontent.com/enram/vpts-csv/main/vpts-csv-table-schema.json"

-> by only adjusting the url so report = validate_vpts(df_vpts, version="v1") becomes possible

Validating generated csv files

@stijnvanhoey CSV files can be validated with:

Create a datapackage.json file with the following content:

{
  "profile": "tabular-data-package",
  "resources": [
    {
      "name": "vpts",
      "path": "vpts.csv",
      "profile": "tabular-data-resource",
      "format": "csv",
      "mediatype": "text/csv",
      "encoding": "utf-8",
      "schema": "https://raw.githubusercontent.com/enram/vpts-csv/main/vpts-csv-table-schema.json"
    }
  ]
}

Place the datapackage.json file in the same directory as your CSV file. Rename the path value if necessary to point to the CSV file (above named vpts.csv).
Install https://github.com/frictionlessdata/frictionless-py, I'm using v4.40.0
Running in CLI: frictionless validate datapackage.json

Move `transfer_2022` (BALTRAD SFTP -> S3) script as function to vptstools

@stijnvanhoey, you probably have already started with this? The idea is to remove the script at https://github.com/enram/data-repository/tree/master/transfer_2022 and make it part of the vptstools.

Move and adjust the s3 inventory

create new bucket "aloft-inventory" and move inventory their
use the CSV inventory (not the parquet one)
add the functionality to remove old inventory files (from manifest -> remove files)
inventory need to be applied for all 'sources' (baltrad, ecol,...)
the file name is used as single source of truth and the file name <-> s3-path logic is a function that can be passed (injected) to the routine.
assign ec2 engine rights to access aloft-inventory bucket

	# TODO - DUMMY FIXES - CAN BE REMOVED AFTER SCHEMA UPDATES
	df_vpts["vcp"] = "12"
	df_vpts[["u", "v", "ff", "dd", "sd_vvp"]] = df_vpts[["u", "v", "ff", "dd", "sd_vvp"]].replace("NaN", 1)
	df_vpts[["ff", "dd", "sd_vvp", "eta"]] = df_vpts[["ff", "dd", "sd_vvp", "eta"]].replace("", 1)