Giter Site home page Giter Site logo

Comments (4)

rebeccaringuette avatar rebeccaringuette commented on July 4, 2024

I found a solution that works nicely without xarray. It requires adding two packages as dependencies for kamodo-ccmc, but removes two other packages. The two packages needed are s3fs and h5netcdf, both installable via pip. h5py and netCDF4 are no longer needed.

The previous method of accessing nc files with netCDF4.Dataset can be replaced with the method below, which works for files stored in s3 buckets and in normal storage.

from h5netcdf.legacyapi import Dataset as Dataset_leg
import s3fs

def Dataset(filename, access='r'):
    if filename[:2] == 's3':
        s3=s3fs.S3FileSystem(anon=False)
        fgrab = s3.open(filename, access+'b')
        return Dataset_leg(fgrab)
    else:
        return Dataset_leg(filename, access)

Notice that the new definition of 'Dataset' automatically performs the correct operation based on whether the file is stored in an s3 bucket or in normal storage. This should be tested on WACCM-X files because the h0 files are produced with the 'NETCDF3_64BIT_OFFSET' option due to the large file sizes generated. This code can go into the reader_utilities.py script and be imported from there so that only the import statements in the effected readers need to be changed.

The normal file search method using glob will need to be replaced by the code below, which also automatically performs the correct operation based on the filename. This code would go nicely in the reader_utilities.py script, from which glob should be imported for all uses. Then, only the import statement in the readers will need to be changed.

from glob import glob as glob_leg
import s3fs

def glob(file_pattern):
    if file_pattern[:2] == 's3':
        s3 = s3fs.S3FileSystem(anon=False)
        s3_files = sorted(s3.glob(file_pattern))
        return ['s3://'+f for f in s3_files]
    else:
        return glob_leg(file_pattern)

The code to replace calls to h5py is

import h5netcdf as h5py  # works for s3 and efs
import s3fs

def convert(filename, access='r'):
    if filename[:2] == 's3':
        s3=s3fs.S3FileSystem(anon=False)
        fgrab = s3.open(filename, access+'b')
        return [fgrab]
    else:
        return [filename, access]
h5_data = h5py.File(*convert(filename))

where convert should be stored in the reader_utilities.py script. This remains to be tested in the relevant readers. h5netcdf and h5netcdf.legacyapi.Dataset both break for the normal/efs case if the file object is given instead of the filename. Note that this does NOT enable writing netcdf/h5 files to s3, so file conversions on the cloud will not be supported.

Since all of the file formats after file conversions are either .h5, .nc or .out files, this reduces the remaining file access problem to reading the two text files produced by each reader and the general I/O in SF_output.py. The open statements in the read_timelist function in reader_utilities.py should be replaced with a call to the function below, which should offer the same resulting behavior for text files on local/efs or s3 storage. This has not been tested.

import s3fs

def _open(filename):
    if filename[:2] == 's3':
        s3 = s3fs.S3FileSystem(anon=False)
        return s3.open(filename)
    else:
        return open(filename)

Reading the csv and ascii files from s3 may be as simple as replacing line 149 in SFcsv_reader with a call to the function above. The behavior of this function with the csv package has not been tested. Writing csv and ascii files directly to s3 buckets may be possible with the function, but this has not been tested. A related issue on xarray's github may be useful if others are interested in writing files to s3.

from kamodo.

lrastaet avatar lrastaet commented on July 4, 2024

Does _open() not work as file argument to IdlFile()?

from kamodo.

rebeccaringuette avatar rebeccaringuette commented on July 4, 2024

No, because spacepy performs the open command. The change has to happen on the spacepy side for the s3 issue.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[5], line 1
----> 1 MW.Variable_Search('magnetic', model, file_dir)

File ~/efs/raringuette/Kamodo/kamodo_ccmc/flythrough/model_wrapper.py:279, in Variable_Search(search_string, model, file_dir, return_dict)
    277         return new_dict
    278 elif file_dir != '' and model != '':
--> 279     ko_var_dict = Model_Variables(model, file_dir, return_dict=True)
    280     new_dict = {name: [value[0], value[-4]+'-'+value[-3],
    281                        value[-2], value[-1]] for name, value in
    282                 ko_var_dict.items() if search_string in value[0].lower()}
    283     if new_dict == {}:

File ~/efs/raringuette/Kamodo/kamodo_ccmc/flythrough/model_wrapper.py:184, in Model_Variables(model, file_dir, return_dict)
    182 else:
    183     reader = Model_Reader(model)
--> 184     ko = reader(file_dir, variables_requested='all')
    186     # either return or print nested_dictionary
    187     if return_dict:

File ~/efs/raringuette/Kamodo/kamodo_ccmc/readers/swmfgm_4D.py:128, in MODEL.<locals>.MODEL.__init__(self, file_dir, variables_requested, filetime, verbose, gridded_int, printfiles, **kwargs)
    126 patterns = unique([basename(f)[:10] for f in files])
    127 # get time grid from files
--> 128 dt = sp.IdlFile(RU._open(files[0]),
    129                 sort_unstructured=False).attrs['time']
    130 if dt is not None:  # filedate given not always at midnight
    131     self.filedate = datetime.strptime(
    132         dt.isoformat()[:10], '%Y-%m-%d').replace(
    133         tzinfo=timezone.utc)

File ~/users_conda_envs/PyHCs3/lib/python3.10/site-packages/spacepy/pybats/__init__.py:1220, in IdlFile.__init__(self, filename, iframe, header, keep_case, sort_unstructured, *args, **kwargs)
   1216 super(IdlFile, self).__init__(*args, **kwargs)  # Init as PbData.
   1218 # Gather information about the file: format, endianess (if necessary),
   1219 # number of picts/frames, etc.:
-> 1220 fmt, endchar, inttype, floattype = _probe_idlfile(filename)
   1221 self.attrs['file'] = filename   # Save file name.
   1222 self.attrs['format'] = fmt        # Save file format.

File ~/users_conda_envs/PyHCs3/lib/python3.10/site-packages/spacepy/pybats/__init__.py:807, in _probe_idlfile(filename)
    804 inttype = np.dtype(np.int32)
    805 floattype = np.dtype(np.float32)
--> 807 with open(filename, 'rb') as f:
    808     # On the first try, we may fail because of wrong-endianess.
    809     # If that is the case, swap that endian and try again.
    810     inttype.newbyteorder(endian)
    812     try:
    813         # Try to parse with little endian byte ordering:

TypeError: expected str, bytes or os.PathLike object, not TextIOWrapper

from kamodo.

rebeccaringuette avatar rebeccaringuette commented on July 4, 2024

This issue is solved in the pull request #131, both for netCDF4 and netCDF3 files (and for h5 files, too), with the exceptions noted is this issue.

from kamodo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.