Giter Site home page Giter Site logo

enram / vptstools Goto Github PK

View Code? Open in Web Editor NEW
3.0 3.0 1.0 3.49 MB

Python library to transfer and convert vertical profile time series data

Home Page: https://enram.github.io/vptstools/

License: MIT License

Python 99.83% Dockerfile 0.17%
aeroecology oscibio weather-radar

vptstools's People

Contributors

peterdesmet avatar pietrh avatar stijnvanhoey avatar thejenne18 avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

thejenne18

vptstools's Issues

Provide some more info in AWS notifications

I'm getting daily AWS notifications with:

CLI routine 'vph5_to_vpts --modified-days-ago 2' failed raising error: '<class 'ValueError'>: File name is not a valid ODIM h5 file.'.

It would be helpful to:

  • Know what file failed
  • Where the log files can be found (Cloudwatch link) decided to keep this part out of functionality

Improve log handling on server

The current configuration on the server for the logs does not take into account log-rotation:

  • add logrotate config to paramiko log
  • convert current crontab-timestamp ...>> /home/ubuntu/transfer-date +%Y%m%d%H%M%S.log 2>&1 in logs to a log to file and adding logrotate config on server-level.

Cannot install dotenv dependency

vph5_to_vpts requires dotenv, but it seems the installation fails:

vph5_to_vpts                                                
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.11/bin/vph5_to_vpts", line 5, in <module>
    from vptstools.bin.vph5_to_vpts import cli
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/vptstools/bin/vph5_to_vpts.py", line 10, in <module>
    from dotenv import load_dotenv
pip3 install dotenv
Collecting dotenv
  Using cached dotenv-0.0.5.tar.gz (2.4 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... error
  error: subprocess-exited-with-error
  
  × pip subprocess to install backend dependencies did not run successfully.
  │ exit code: 1
  ╰─> [29 lines of output]
      Collecting distribute
        Using cached distribute-0.7.3.zip (145 kB)
        Installing build dependencies: started
        Installing build dependencies: finished with status 'done'
        Getting requirements to build wheel: started
        Getting requirements to build wheel: finished with status 'done'
        Preparing metadata (pyproject.toml): started
        Preparing metadata (pyproject.toml): finished with status 'error'
        error: subprocess-exited-with-error
      
        × Preparing metadata (pyproject.toml) did not run successfully.
        │ exit code: 1
        ╰─> [6 lines of output]
            usage: setup.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
               or: setup.py --help [cmd1 cmd2 ...]
               or: setup.py --help-commands
               or: setup.py cmd --help
      
            error: invalid command 'dist_info'
            [end of output]
      
        note: This error originates from a subprocess, and is likely not a problem with pip.
      error: metadata-generation-failed
      
      × Encountered error while generating package metadata.
      ╰─> See above for output.
      
      note: This is an issue with the package mentioned above, not pip.
      hint: See above for details.
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× pip subprocess to install backend dependencies did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

GitHub workflows

  1. test.yml: run for every commit, one Python version / for PR, all Python versions
  2. documentation.yml: for every commit to main, run test, run Sphinx, push to gh-pages branch
  3. release.yml: run for commits with a tag, run tests for all Python version, push to Pypi

Documentation build warning

When running tox -e docs, all works fine, but a warning is raised:

/Users/peter_desmet/Coding/Repositories/enram/vptstools/src/vptstools/__init__.py:docstring of vptstools.vpts.vpts:1: WARNING: duplicate object description of vptstools.vpts, other instance in api/vptstools, use :no-index: for one of them
WARNING: autodoc: failed to import module 'transfer_baltrad' from module 'vptstools.bin'; the following exception was raised:
cannot import name 'report_exception_to_sns' from 'vptstools.bin.click_exception' (/Users/peter_desmet/Coding/Repositories/enram/vptstools/docs/../src/vptstools/bin/click_exception.py)

Cannot create coverage file locally on Mac

  1. tox -e dev
  2. source venv/bin/activate
  3. tox

Tests works fine until 98% after which I get an error:

tests/test_vpts_csv.py::TestVptsCsvV1SupportFun::test_check_source_file PASSED                                                                                            [ 98%]
/Users/peter_desmet/Coding/Repositories/enram/vptstools/.tox/py39/lib/python3.9/site-packages/coverage/data.py:166: CoverageWarning: Couldn't use data file '/Users/peter_desmet/Coding/Repositories/enram/vptstools/.coverage.Peters-MacBook-Air.local.50554.789008-journal': file is not a database
  data._warn(str(exc))
tests/test_vpts_csv.py::TestVptsCsvV1SupportFun::test_check_source_file_wrong_file PASSED                                                                                 [100%]
INTERNALERROR> Traceback (most recent call last):
INTERNALERROR>   File "/Users/peter_desmet/Coding/Repositories/enram/vptstools/.tox/py39/lib/python3.9/site-packages/_pytest/main.py", line 270, in wrap_session
INTERNALERROR>     session.exitstatus = doit(config, session) or 0
INTERNALERROR>   File "/Users/peter_desmet/Coding/Repositories/enram/vptstools/.tox/py39/lib/python3.9/site-packages/_pytest/main.py", line 324, in _main
INTERNALERROR>     config.hook.pytest_runtestloop(session=session)
INTERNALERROR>   File "/Users/peter_desmet/Coding/Repositories/enram/vptstools/.tox/py39/lib/python3.9/site-packages/pluggy/_hooks.py", line 265, in __call__
INTERNALERROR>     return self._hookexec(self.name, self.get_hookimpls(), kwargs, firstresult)
INTERNALERROR>   File "/Users/peter_desmet/Coding/Repositories/enram/vptstools/.tox/py39/lib/python3.9/site-packages/pluggy/_manager.py", line 80, in _hookexec
INTERNALERROR>     return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
INTERNALERROR>   File "/Users/peter_desmet/Coding/Repositories/enram/vptstools/.tox/py39/lib/python3.9/site-packages/pluggy/_callers.py", line 55, in _multicall
INTERNALERROR>     gen.send(outcome)
INTERNALERROR>   File "/Users/peter_desmet/Coding/Repositories/enram/vptstools/.tox/py39/lib/python3.9/site-packages/pytest_cov/plugin.py", line 297, in pytest_runtestloop
INTERNALERROR>     self.cov_controller.finish()
INTERNALERROR>   File "/Users/peter_desmet/Coding/Repositories/enram/vptstools/.tox/py39/lib/python3.9/site-packages/pytest_cov/engine.py", line 44, in ensure_topdir_wrapper
INTERNALERROR>     return meth(self, *args, **kwargs)
INTERNALERROR>   File "/Users/peter_desmet/Coding/Repositories/enram/vptstools/.tox/py39/lib/python3.9/site-packages/pytest_cov/engine.py", line 242, in finish
INTERNALERROR>     self.cov.stop()
INTERNALERROR>   File "/Users/peter_desmet/Coding/Repositories/enram/vptstools/.tox/py39/lib/python3.9/site-packages/coverage/control.py", line 807, in combine
INTERNALERROR>     combine_parallel_data(
INTERNALERROR>   File "/Users/peter_desmet/Coding/Repositories/enram/vptstools/.tox/py39/lib/python3.9/site-packages/coverage/data.py", line 148, in combine_parallel_data
INTERNALERROR>     with open(f, "rb") as fobj:
INTERNALERROR> FileNotFoundError: [Errno 2] No such file or directory: '/Users/peter_desmet/Coding/Repositories/enram/vptstools/.coverage.Peters-MacBook-Air.local.50618.245710-journal'

Comparison of hdf5/daily/monthly files

HDF5

library(dplyr)
library(readr)
library(bioRad)
files <- list.files("~/Downloads/bejab/aloft/hdf5", full.names = TRUE)
vp <- bioRad::read_vpfiles(files)
vpts <-
  bioRad::bind_into_vpts(vp) %>%
  as.data.frame(geo = TRUE, suntime = FALSE) %>%
  dplyr::arrange(datetime, height)
readr::write_csv(vpts, "vpts.csv") # This converts NaN to NA, I have manually set those back to NaN below
nrow(vpts)
# 35575
radar datetime ff dbz dens u v gap w n_dbz dd n DBZH height n_dbz_all eta sd_vvp n_all lat lon height_antenna
bejab 2023-02-02T00:00:00Z 0 NaN -2.578521966934204 18.094850540161133 NaN NaN 1 NaN 2237 NaN 227 25.863487243652344 8821 199.04335021972656 2.0527188777923584 436 51.191700000000004 3.0642000000000005 50
bejab 2023-02-02T00:00:00Z 0 NaN -3.3399300575256348 15.184979438781738 NaN NaN 1 NaN 1568 NaN 163 24.80197525024414 8828 167.03477478027344 2.5198092460632324 456 51.191700000000004 3.0642000000000005 50
bejab 2023-02-02T00:00:00Z 0 NaN -3.988534688949585 13.078372955322266 NaN NaN 1 NaN 2122 NaN 223 23.728849411010742 8862 143.8621063232422 2.4304895401000977 444 51.191700000000004 3.0642000000000005 50
bejab 2023-02-02T00:00:00Z 200 3.0408787727355957 -7.179384708404541 6.272905349731445 2.805347204208374 -1.1734440326690674 0 -21.234346389770508 19771 112.69898986816406 740 16.758867263793945 22850 69.00196075439453 2.9160335063934326 873 51.191700000000004 3.0642000000000005 50
bejab 2023-02-02T00:00:00Z 200 3.2996182441711426 -7.66164493560791 5.613616466522217 2.8031976222991943 -1.740564227104187 0 5.670511245727539 18076 121.8370132446289 568 18.90127182006836 22860 61.749778747558594 2.4480631351470947 866 51.191700000000004 3.0642000000000005 50

Daily

d1 <- readr::read_csv("~/Downloads/aloft/bejab/bejab_vpts_20230202.csv")
d2 <- readr::read_csv("~/Downloads/aloft/bejab/bejab_vpts_20230203.csv")
d3 <- readr::read_csv("~/Downloads/aloft/bejab/bejab_vpts_20230213.csv")
d4 <- readr::read_csv("~/Downloads/aloft/bejab/bejab_vpts_20230214.csv")
d5 <- readr::read_csv("~/Downloads/aloft/bejab/bejab_vpts_20230215.csv")
daily <- bind_rows(d1, d2, d3, d4, d5)
radar datetime height u v w ff dd sd_vvp gap eta dens dbz dbz_all n n_dbz n_all n_dbz_all rcs sd_vvp_threshold vcp radar_latitude radar_longitude radar_height radar_wavelength source_file
bejab 2023-02-02T00:00:00Z 0 NaN NaN NaN NaN NaN 2.0527188777923584 TRUE 199.04335021972656 18.094850540161133 -2.578521966934204 25.863487243652344 227 2237 436 8821 11.0 2.0 51.1917 3.0642 50 5.3 bejab_vp_20230202T000000Z_0x9.h5
bejab 2023-02-02T00:00:00Z 0 NaN NaN NaN NaN NaN 2.5198092460632324 TRUE 167.03477478027344 15.184979438781738 -3.3399300575256348 24.80197525024414 163 1568 456 8828 11.0 2.0 51.1917 3.0642 50 5.3 bejab_vp_20230202T000500Z_0x9.h5
bejab 2023-02-02T00:00:00Z 0 NaN NaN NaN NaN NaN 2.4304895401000977 TRUE 143.8621063232422 13.078372955322266 -3.988534688949585 23.728849411010742 223 2122 444 8862 11.0 2.0 51.1917 3.0642 50 5.3 bejab_vp_20230202T001000Z_0x9.h5
bejab 2023-02-02T00:00:00Z 200 2.805347204208374 -1.1734440326690674 -21.234346389770508 3.0408787727355957 112.69898986816406 2.9160335063934326 FALSE 69.00196075439453 6.272905349731445 -7.179384708404541 16.758867263793945 740 19771 873 22850 11.0 2.0 51.1917 3.0642 50 5.3 bejab_vp_20230202T000000Z_0x9.h5
bejab 2023-02-02T00:00:00Z 200 2.8031976222991943 -1.740564227104187 5.670511245727539 3.2996182441711426 121.8370132446289 2.4480631351470947 FALSE 61.749778747558594 5.613616466522217 -7.66164493560791 18.90127182006836 568 18076 866 22860 11.0 2.0 51.1917 3.0642 50 5.3 bejab_vp_20230202T000500Z_0x9.h5

Monthly

monthly <- readr::read_csv("~/Downloads/aloft/bejab_vpts_202302.csv.gz")
radar datetime height u v w ff dd sd_vvp gap eta dens dbz dbz_all n n_dbz n_all n_dbz_all rcs sd_vvp_threshold vcp radar_latitude radar_longitude radar_height radar_wavelength source_file
bejab 2023-02-02T00:00:00Z 0 2.0527188777923584 TRUE 199.04335021972656 18.094850540161133 -2.578521966934204 25.863487243652344 227 2237 436 8821 11.0 2.0 51.1917 3.0642 50 5.3 bejab_vp_20230202T000000Z_0x9.h5
bejab 2023-02-02T00:00:00Z 0 2.5198092460632324 TRUE 167.03477478027344 15.184979438781738 -3.3399300575256348 24.80197525024414 163 1568 456 8828 11.0 2.0 51.1917 3.0642 50 5.3 bejab_vp_20230202T000500Z_0x9.h5
bejab 2023-02-02T00:00:00Z 0 2.4304895401000977 TRUE 143.8621063232422 13.078372955322266 -3.988534688949585 23.728849411010742 223 2122 444 8862 11.0 2.0 51.1917 3.0642 50 5.3 bejab_vp_20230202T001000Z_0x9.h5
bejab 2023-02-02T00:00:00Z 200 2.805347204208374 -1.1734440326690674 -21.234346389770508 3.0408787727355957 112.69898986816406 2.9160335063934326 FALSE 69.00196075439453 6.272905349731445 -7.179384708404541 16.758867263793945 740 19771 873 22850 11.0 2.0 51.1917 3.0642 50 5.3 bejab_vp_20230202T000000Z_0x9.h5
bejab 2023-02-02T00:00:00Z 200 2.8031976222991943 -1.740564227104187 5.670511245727539 3.2996182441711426 121.8370132446289 2.4480631351470947 FALSE 61.749778747558594 5.613616466522217 -7.66164493560791 18.90127182006836 568 18076 866 22860 11.0 2.0 51.1917 3.0642 50 5.3 bejab_vp_20230202T000500Z_0x9.h5
testthat::expect_equal(daily, monthly)

Refactor s3fs (and the aotbotocore) out as a dependency

s3fs is a convenient package to interact with s3 (feels like coding against normal file system), but since s3f3 relies on the async aiobotocore package some issues arised:

  • due to very strict dependencies in both boto3 and aiobotocore, the pip solver was not able to solve this unless defining boto3 as an additional install from s3fs (see daeb285 for fix)
  • aiobotocore requires raw_headers which are not according to specs in the moto-library used in the testing (see getmoto/moto#3259 for description and 802b9f5 for the fix)
  • aiobotocore/s3fs keep in an recursive loop copy-pasting the files in the pytest-context leasing to an ever increasing file size when running the tests. Has been solved by relying on boto3 instead for the download, see 28a5e42

Hence, these issues have been handled in https://github.com/enram/vptstools/tree/SVH-country-filter, but I'm not sure if the convenience of s3fs is worthwhile having it as a dependency. It might be that excluding s3fs would make things easier to maintain.

Provide a full rerun functionality

Need to be able to remove all CSV files (daily/monthly) in the s3 aloft bucket and replace them by a new run with a given schema-version implementation.

Need to be operational to have a rerun after is merged. enram/vpts-csv#42

Investigate MyPy issues

I'd like to (sooner than later) use MyPy to make this package more robust.

A first try give me error: Skipping analyzing "odimh5.reader": found module but no type hints or library stubs which seems weird since the odimh5 page is type-annotated.

To investigate.

Would be useful if coverage.csv also provided info on daily/monthly

The coverage.csv only provides coverage for the hdf5 portion of each source (baltrad, uva, ecog). It would be useful if it also provided the coverage for the daily and monthly portion for each source. That is not a trivial change however, since the coverage is based on the AWS inventory and that inventory is limited to the hdf5 files (because that is the aspect that is important for the vph5_to_vpts).

Where to drop csv files?

The directory consensus for files is (#enram/data-repository#65 (comment)) is source/format/radar/yyyy/

I suggest:

# source data
baltrad/hdf5/radar/yyyy/mm/dd/ file.h5

# daily unzipped csv
baltrad/daily/radar/yyyy/ file.csv

# monthly gzipped csv
baltrad/monthly/radar/yyyy/ file.csv.gz

Investigate `vcp` error

Logs for entire bucket indicate:

September 06, 2023 at 12:34 (UTC+2:00)[WARNING] - During conversion from HDF5 files of baltrad/bejab at 2018-06-03 to daily VPTS file, the following error occurred: 'vcp'.
sync
September 06, 2023 at 12:33 (UTC+2:00)Create daily VPTS file baltrad/daily/bejab/2018/bejab_vpts_20180603.csv.
sync

What is the vcp error?

Cannot reproduce second README example

Using these files that I save in data:

https://aloftdata.s3-eu-west-1.amazonaws.com/baltrad/hdf5/nldbl/2013/11/23/nldbl_vp_20131123T0000Z.h5
https://aloftdata.s3-eu-west-1.amazonaws.com/baltrad/hdf5/nldbl/2013/11/23/nldbl_vp_20131123T0015Z.h5
https://aloftdata.s3-eu-west-1.amazonaws.com/baltrad/hdf5/nldbl/2013/11/23/nldbl_vp_20131123T0030Z.h5

I get an error if I want to reproduce the second README example:

from pathlib import Path
from vptstools.vpts import vpts

file_paths = sorted(Path("./data").rglob("*.h5"))  # Get all h5 files within the data directory
df_vpts = vpts(file_paths)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/spawn.py", line 120, in spawn_main
    exitcode = _main(fd, parent_sentinel)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/spawn.py", line 129, in _main
    prepare(preparation_data)
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/spawn.py", line 240, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/spawn.py", line 291, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen runpy>", line 291, in run_path
  File "<frozen runpy>", line 98, in _run_module_code
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/peter_desmet/Coding/Repositories/enram/vptstools/test.py", line 5, in <module>
    df_vpts = vpts(file_paths)
              ^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/vptstools/vpts.py", line 256, in vpts
    with multiprocessing.Pool(processes=cpu_count) as pool:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py", line 215, in __init__
    self._repopulate_pool()
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py", line 306, in _repopulate_pool
    return self._repopulate_pool_static(self._ctx, self.Process,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If I print file_paths I get:

[PosixPath('data/nldbl_vp_20131123T0000Z.h5'), PosixPath('data/nldbl_vp_20131123T0015Z.h5'), PosixPath('data/nldbl_vp_20131123T0030Z.h5')]

Not corresponding datetime between h5 filename and h5 content (/what/time, "HHmmss")

I downloaded a set of files from the bejab data as a test case and while trying out the CSV concatenation (to create a vpts-csv), I encountered repeated timestamps for multiple files not corresponding to the file name included timestamp:

FILENAME                         /what/time               
bejab_vp_20221111T233000Z_0x9.h5 233000
bejab_vp_20221111T234000Z_0x9.h5 233000
bejab_vp_20221111T234500Z_0x9.h5 234500
bejab_vp_20221111T235000Z_0x9.h5 234500
bejab_vp_20221111T235500Z_0x9.h5 234500

To check, I downloaded some files from the Baltrad sftp directly and compared the timestamp of the file with the timestamp of the /what/time, leading to several of these differences (67% on my quick test on 50 files):

FILE    WHAT/TIME   FILEPATH
2250    2245        bejab_vp_20221112T225000Z_0x9.h5
0235    0230        bewid_vp_20221113T023500Z_0xb.h5
1635    1630        chppm_vp_20221114T163500Z_0xb.h5
0310    0300        dedrs_vp_20221115T031000Z_0xb.h5
0105    0100        defbg_vp_20221114T010500Z_0xb.h5
1025    1015        deisn_vp_20221115T102500Z_0xb.h5
0125    0115        denhb_vp_20221114T012500Z_0xb.h5
0505    0500        denhb_vp_20221114T050500Z_0xb.h5
1210    1200        eehar_vp_20221113T121000Z_0xb.h5
0410    0400        eehar_vp_20221114T041000Z_0xb.h5
0520    0515        esalm_vp_20221114T052000Z_0xb.h5
1410    1400        esbar_vp_20221113T141000Z_0xb.h5
1420    1415        essse_vp_20221114T142000Z_0xb.h5
1040    1030        esval_vp_20221115T104000Z_0xb.h5
0150    0145        filuo_vp_20221114T015000Z_0xb.h5
0255    0245        finur_vp_20221114T025500Z_0xb.h5
1440    1430        frabb_vp_20221114T144000Z_0xb.h5
0050    0045        frcol_vp_20221115T005000Z_0xb.h5
1835    1830        frmcl_vp_20221114T183500Z_0xb.h5
1340    1330        frmom_vp_20221114T134000Z_0xb.h5
2050    2045        frnim_vp_20221113T205000Z_0xb.h5
0320    0315        frniz_vp_20221113T032000Z_0xb.h5
0640    0630        frtou_vp_20221113T064000Z_0xb.h5
2250    2245        frtra_vp_20221114T225000Z_0xb.h5
0825    0815        frtre_vp_20221113T082500Z_0xb.h5
0605    0600        nohgb_vp_20221115T060500Z_0xb.h5
1555    1545        nosmn_vp_20221113T155500Z_0xb.h5
0020    0015        plram_vp_20221114T002000Z_0xb.h5
2205    2200        sekaa_vp_20221113T220500Z_0xb.h5
0210    0200        sevax_vp_20221113T021000Z_0xb.h5

@peterdesmet is this a known issue or am I stuck on a bug I just can't get around? For the latter experiment I relied only on h5py package as a dependency (I left out the vptstools modules and just tried to extract only the timestamps):

import h5py

file_paths = sorted(Path("../data/raw/baltrad/").rglob("*.h5"))

for j, path_h5 in enumerate(file_paths):
    with h5py.File(path_h5, mode="r") as odim_vp:
        time_filename = path_h5.stem.split("_")[2][9:13]
        time_h5_what = odim_vp["what"].attrs.get("time").decode("utf-8")[:-2]
        if time_filename != time_h5_what:
            print(time_filename, time_h5_what, path_h5)

The time difference might not be an issue if the timestamps are unique among the different files. Or should we rather use the timestamp from the file path of the h5 files?

Run with `--modified-days-ago 0` failed silently

@TheJenne18 ran a full processing with --modified-days-ago 0, using 8 vCPUs en 16GB. The job stopped silently, with no data added to the bucket. The only line in the log file is:

Recreate the full set of bucket files (files modified since 401days). This will take a while!

Which I assume is just the start of the process. Did it fail reading the full inventory in memory? It might be useful to provide more messages, to know at what point the processing failed.

How to handle the gain/offset in the conversion from hdf5 to vpts-csv

In the https://github.com/adokter/vol2bird/wiki/ODIM-bird-profile-format-specification#specification-of-bird-profile-output-in-odim-hdf5-format specification there is a gain and an offset for the datasets/variables. In the conversion from h5 to vpts-CSV, the current implementation does not take these into account. @adokter, should this actually be done by default and store in the vpts-csv version for each record quantity*gain+offset instead of the quantity?

Incorrect author in PyPi

https://github.com/enram/vptstools/blob/main/setup.cfg has been updated to have INBO as author. When I install from PyPi, I still see the old information:

pip3 show vptstools
Name: vptstools
Version: 0.2.2
Summary: Tools to work with vertical profile time series.
Home-page: https://enram.github.io/vptstools/
Author: enram
Author-email: 
License: MIT
Location: /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages
Requires: click, frictionless, h5py, pandas, pytz
Required-by: 

How can an updated be forced?

Manage tags/releases

I notice we have many tags and 1 release. @stijnvanhoey:

  1. Is a tag sufficient to have it be picked up by GitHub Actions and used in the operational pipeline? Or does it require a release.
  2. Is a tag sufficient to have it be published to PyPi? Yes, see https://pypi.org/project/vptstools/
  3. Should we create releases for the most recent tags? Maybe only for the minor versions (not hotfixes)
  4. Should we clean up the tags, there seems to be a mixed of usage. Here's my suggestion
  • v0.2.2: keep
  • v0.2.1: keep
  • v0.2.0: keep
  • 0.1.0a14: delete
  • 0.1.0a13: delete
  • v0.1.0: keep, is release
  • 0.1.0a12: delete
  • 0.1.0a11: delete
  • 0.1.0a10: delete
  • 0.1.0a9: delete
  • 0.1.0a8: delete
  • 0.1.0a7: delete
  • 0.1.0a6: delete
  • 0.1.0a5: delete
  • 0.1.0a4: delete
  • 0.1.0a3: delete
  • 0.1.0a2: delete
  • 0.1.0a1: delete

Add 'source' column

See enram/vpts-csv#42, implementation wise:

  • The data record should be the full s3 path: s3://aloft/baltrad...
  • Added as last column in the mapping
  • Added in sorting dict(radar=str, datetime=str, height=int, source_file=str)

Flow for converting h5 to VPTS CSV files

Pseudo code:

h5_files = get_h5_files(radar, start, end, source) # returns list of paths
df = empty df
for h5_file in h5_files:
  df = h5_to_df(h5_file)
  append(df, df)
pandas:write_csv(df, "some/path/name.csv")

So:

  1. a custom function get_h5_files() that understand the directory structure of the repo. It likely makes use of the s3 library under the hood to get a list of file paths that match a radar, start, end date, source criterium.

  2. a custom function h5_to_df() that reads a h5 file and converts it to VPTS CSV format, but as a dataframe, not a file. The function can be called many times to build a growing data frame.

  3. a generic write_csv() function (e.g. from pandas) that writes the df to a file at some location. The write_csv() settings should match those of the csv dialect defined for VPTS CSV

Update the CI setup

Points to cover:

  • check the python versions to which we want the ci (server on which deployed?)
  • pip-tools integration to provide deployable version with fixed dependency set (+ ci-logic)
  • deployment sphinx website and pypi (key setup,...)
  • ci logic of the lint/formatting so it raises errors on message

AWS error references deleted file

I got an AWS notification email today:

CLI routine 'vph5_to_vpts --modified-days-ago 2' failed raising error: '<class 'ValueError'>: File name uva/hdf5/dbl/2008/02/17/nldbl_vp_20080217t0000_nl50_v0-3-20.h5 is not a valid ODIM h5 file.'.

It now includes the name of the file 👍. The mentioned file is however no longer in the repository. It was deleted August 21 or 22 (2 or 1 days ago). I'll see tomorrow if the issue resolves itself, i.e. the inventory is updated and the file is no longer listed there and no notification is generated.

@stijnvanhoey @TheJenne18 can deleted files linger in the inventory? Should this resolve itself automatically? Not sure we considered this when designing the architecture.

Note this error did not stop the creation of the daily and monthly files 👍

Provide environment in exception messages

Suggestion by @TheJenne18 to avoid confusion from what environment notifications are send (#62), it might be useful to provide the os.environment("ENV") in error messages such as:

CLI routine 'vph5_to_vpts --modified-days-ago 2' failed raising error: '<class 'ValueError'>: File name uva/hdf5/dbl/2008/02/17/nldbl_vp_20080217t0000_nl50_v0-3-20.h5 is not a valid ODIM HDF5 file.'.

Sample data for unit testing

@niconoe when running the current unit tests, there is a reference to sample data, which is not available in the repository. Is there a reference or documentation available on the example data setup? Should I just put 'any' h5 file to make the test_error_non_vp_source_file test working (as this should fail)?

Add reproducible example in README

Add section Usage with a simple reproducible example (python or command line code) to show how three h5 files (downloaded from aloft) can be converted to vpts csv. Tackle in #50

Improve upload handling

In order to speed up the uploads towards the S3, the handling of multiple files at the same time would be a huge improvement.
A first option would be working with async, but as the boto3 libraryr currently not yet support async handling, this approach will yet fail to work. Working with multiple threads or working parallel would be an valid option to implement.

Validating generated csv files

@stijnvanhoey CSV files can be validated with:

  1. Create a datapackage.json file with the following content:
{
  "profile": "tabular-data-package",
  "resources": [
    {
      "name": "vpts",
      "path": "vpts.csv",
      "profile": "tabular-data-resource",
      "format": "csv",
      "mediatype": "text/csv",
      "encoding": "utf-8",
      "schema": "https://raw.githubusercontent.com/enram/vpts-csv/main/vpts-csv-table-schema.json"
    }
  ]
}
  1. Place the datapackage.json file in the same directory as your CSV file. Rename the path value if necessary to point to the CSV file (above named vpts.csv).
  2. Install https://github.com/frictionlessdata/frictionless-py, I'm using v4.40.0
  3. Running in CLI: frictionless validate datapackage.json

Move and adjust the s3 inventory

  • create new bucket "aloft-inventory" and move inventory their
  • use the CSV inventory (not the parquet one)
  • add the functionality to remove old inventory files (from manifest -> remove files)
  • inventory need to be applied for all 'sources' (baltrad, ecol,...)
  • the file name is used as single source of truth and the file name <-> s3-path logic is a function that can be passed (injected) to the routine.
  • assign ec2 engine rights to access aloft-inventory bucket

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.