carpentries-lab / python-aos-lesson Goto Github PK

Python for Atmosphere and Ocean Scientists

Home Page: https://carpentries-lab.github.io/python-aos-lesson/

License: Other

Makefile 22.30% Ruby 0.50% Python 66.74% TeX 10.46%

carpentries python lesson data-carpentry atmospheric-science geoscience programming geospatial geospatial-data data-visualisation

python-aos-lesson's People

Contributors

Stargazers

Watchers

Forkers

mitchellblack ocefpaf payton1004 fmichonneau mdsumner whigg sreenathpaleri hong1992 felehaile jbw900 claudiofgcardoso cbchisanga vijayvish20 mtoqeerpk wenlongxu wenfanwu sarahymurphy eldobbins pandasambit15 yx577 adyork cycle13 lindsayrabrams wd15 hot007 haoli2025 lsxinh bkimo juliaengdahl leo-rain ruantanglfy chanjeunlam osakah09 quangchiem139 liu-ran frank0434 pletzer damienirving znicholls camillyfuentes icpac-igad agumase lee1043 jimcircadian steelpl jzhe3

python-aos-lesson's Issues

[JOSE] Update AUTHORS file

AUTHORS should include @DamienIrving to synchronize with JOSE submission.

CC: openjournals/jose-reviews/issues/37

[JOSE] xarray module is imported as `xr` but tied to call as `xarray`

In lesson 07-vectorisation the first call to xarray.open_dataset has a syntax error as the xarray module is imported as xr:
Change xarray.open_dataset to xr.open_dataset ?

CC: openjournals/jose-reviews#37

Switch from precipitation to ocean data

A number of improvements could be made by switching from a data analysis problem focused on precipitation data to once focused on an ocean variable:

It would allow the topic of regridding (i.e. from curvilinear to rectilinear) to be explored (see #10)
It would allow the topic of large arrays to be covered (see #8)
It would be a more appropriate use of the cmocean library

(The vectorisation lesson could be a basin file rather than an sftlf file.)

clim.units results in attribute error

When working on the second episode, when I get to change the units for the precip data, I get the following error.

AttributeError: cannot set attribute 'units' on a 'DataArray' object. Use setitem style assignment (e.g., ds['name'] = ...) instead to assign variables.

Import lesson template's scripts for format checking and rendering

The Carpentries Lesson Template includes a section on Checking and Previewing. There are scripts for that which are not included in the python-aos-lesson.

I would like to try using those scripts to render locally before I push changes to my repo. But it might not be needed if most development and editing is done directly in GitHub. If I use the scripts, I will PR a branch with them.

xarray.compute() should return an xarray instance

Hi @DamienIrving,

Towards the end of the large data episode, https://carpentries-lab.github.io/python-aos-lesson/10-large-data/index.html, pr_max.compute() is invoked to compute the max precipitation values across all times.

This pr_max.compute() call should:

return an xarray (ie pr_max_done = pr_max.compute(), otherwise the result of the computation is lost. Note that the compute method does not modify the dataset, but returns a new dataset with the computation completed.

--or--

call the load() method.

From the xarray doc: https://xarray.pydata.org/en/stable/generated/xarray.DataArray.compute.html
Manually trigger loading of this array’s data from disk or a remote source into memory and return a new array. The original is left unaltered.

Dead link in the Large Data section

The link to the instructor notes (https://carpentrieslab.github.io/python-aos-lesson/guide/index.html) is dead. Thanks for fixing.

Create a synthetic large dataset?

For the large data lesson, the instructor downloads and processes a 45GB dataset. It's impractical to require learners to download that dataset, so they don't get any hands on experience with actually processing large data.

A work around might be to simply run an xarray command to duplicate the monthly dataset used in the earlier lessons (e.g. repeat the time axis 500 times) so it's much larger?

Add content on regridding?

This is a common issue, particular for data on a curvilinear ocean grid. Relevant notes can be found at the following issues:

[JOSE] general inconsistencies in how you refer to relative paths

There are general inconsistencies in how you refer to relative paths - sometimes data/... other times .../data/.... While most users can figure this out it would be better to go through the lessons and tidy this up

CC: openjournals/jose-reviews#37

New EOS book on Earth Observation Using Python: A Practical Programming Guide

It might be good to reference this publication somewhere in the materials for this lesson. If you point me to the right place, I can do the edits and make a PR.

Replace FIXMEs in CONTRIBUTING.md

The CONTRIBUTING.md file still has the default information in it and needs to be updated with the specifics relating to this repository.

[JOSE] Add netcdf4 to the install in lesson 01-conda

The xr.open_dataset call in lesson 02-visualisation threw an TypeError for me. The error was:

TypeError: Error: .../data-carpentry/data/pr_Amon_ACCESS1-3_historical_r1i1p1_200101-200512.nc is not a valid NetCDF 3 file
            If this is a NetCDF4 file, you may need to install the
            netcdf4 library, e.g.,

            $ pip install netcdf4

This is likely due to the an assumption that the conda env built by the users will have the python netcdf4 in it... but that was not the case for my set up so I have submitted an issue to add netcdf4 to the install process in lesson 01-conda.

I recommend adding netCDF4 to to the install explicitly.

CC: openjournals/jose-reviews#37

Other options for parallel processing

In the exercises of the large data lesson, learners are required to think about alternative options to Dask. The ARC Centre of Excellence for Climate Extremes have some notes that go through the options in detail. It would be good to have learners explore those notes so that they're aware of GNU parallel and the multiprocessing library.

Finish large data lesson

For now I've put the new large data lesson in a new development directory (development/10-large-data.md) so it doesn't appear on the website until we've sorted a few issues.

By using OPeNDAP, when we time how long it takes to calculate the daily maximum we are conflating the time it takes to access the data over wifi with the time it takes to actually process the data. On my laptop with the data stored locally, it takes 2 minutes to calculate the daily maximum on 1 core and 8 minutes on 4 cores (see this notebook). (And for some reason I'm still getting NaNs...) Accessing the data via OPeNDAP can mean the calculation takes anywhere from 20 minutes to over an hour depending on internet traffic in my neighborhood (at least I think that's the reason for the variability in how long it takes).

One solution would be for the instructor to have the data stored locally on their laptop. We could add a note in the lesson to tell the instructor to download the required data from their closest ESGF node. We already expect that students will just watch the instructor live code an example rather than run the code themselves, so we can just restrict the exercises to questions that don't require downloading the data. This would mean that for a given instructor on a given computer, the processing time would be repeatable and we could hopefully construct a data processing task that is faster in parallel than in serial (and hopefully also a reasonably short processing time so that it could be done live).

The problem with that solution is that a data processing task that is faster in parallel than serial on one computer (and that runs quickly enough to do live) might not be on another. I guess we could just tell the instrutor to show the lesson notes on the screen rather than running the commands themselves if they aren't able to download/store the data on their laptop or processes it in an acceptable time live.

Thoughts, @hot007?

Zenodo authorization

@fmichonneau As the last step of publishing these lessons with the Journal of Open Source Education, I need to generate a DOI using Zenodo:
https://guides.github.com/activities/citable-code/

I had Zenodo send an authorization request to carpentrieslab, which I'm assuming came to you?

Add content on Dask task graph and debugging

At my 2021 Dask Summit presentation about teaching Dask to atmosphere and ocean scientists it was suggested that content could be added about the Dask task graph and debugging / best practices for finding pain points.

It was suggested that this PyData talk might be useful:
https://www.youtube.com/watch?v=JoK8V2eWFPE

Reconsider asserts

I really like the Defensive Programming unit, but I'm of the opinion that you may want to consider teaching assert as a tool for validating user input. For one, explicit errors like ValueError are more descriptive of the problem rather than the generic AssertionError. The other problem is that if you run python as python -O or python -OO (which is short for setting PYTHONOPTIMIZE=1), all assert statements are ignored/compiled out. In general, assertions are supposed to be used to double check conditions that should never occur, not for validation of user input. Some other posts around this:

Just my $0.02.

Use shorter file names?

On one hand I like the longer file names used in the lessons (e.g. pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412-SON-clim.png) because it teaches good practices around data reference syntax, but on the other hand it's a lot of typing for instructors when live coding...

Capturing small changes

As we're running this lesson next week, I'm doing a review at present and noticing a couple of small improvements. I'll update the list as I work through things and submit another PR

print(dset['pr']) does not provide the output given for attributes (it gets truncated if using the python REPL) so can add a call to pprint to overcome this consistently - applies to 02 and 08
Be nice to show the default dask dashboard for monitoring computations / building intuition in the large data lesson (and it's easily added to content)
logging.critical is referred to as critcal (missing an i)

Relevant lessons out there

I'm going to use this issue to keep a list of other relevant lessons out there. At the moment I've got:

https://annefou.github.io/metos_python/

Add content on data management

The content on data reference syntax from my old capstone lesson might be useful here?

Update 06-github

GitHub has recently changed the interface at the time of repo creation to prompt creation of license, readme etc.
Also note GitHub seems to be pushing strongly toward use of SSH keys. I think if we can avoid it we should to keep this conceptually simple (ie avoid keys), but I think it would be good for @DamienIrving to update this lesson with new screenshots to ensure it's accurate with the current GitHub interface, and maybe mention why SSH keys are good for security.

Add example of zoomed in lat/lon in addition to the global plot

Showing a zoomed in portion of the global map can provide further examples of using cartopy features such as:

using the dataframe.sel(lat=x,lon=y,method='nearest') method to select the lat/lon extent.
resolving the coastlines to cartopy.feature.coastlines('10m'), which will show coastlines at 10m resolution. This is good if you are plotting a specific region.
If zoomed in on the mid Atlantic bight you could use cartopy.feature.STATES to show state lines.

This can be accomplished by doing ax.add_feature(cartopy.feature.STATES) or ax.coastlines('10m').

This is an important aspect for oceanographers because we often analyze data in specific regions, unless the study is specific to global data, and showing these features allows for better interpretation for data location.

Add metadata to images

@DamienIrving I think the following will solve the problem of adding metadata to images:

f = "test.png"
METADATA = {"History": "command line argument"}

# Create a sample image
import pylab as plt
import numpy as np
X = np.random.random((50,50))
plt.imshow(X)
plt.savefig(f)

# Use PIL to save some image metadata
from PIL import Image
from PIL import PngImagePlugin

im = Image.open(f)
meta = PngImagePlugin.PngInfo()

for x in METADATA:
    meta.add_text(x, METADATA[x])
im.save(f, "png", pnginfo=meta)

im2 = Image.open(f)
print(im2.info)

conda-forge channel needs full path or can't be added

Hi there,

This page needs to be edited to include the full URL of the conda-forge channel. Otherwise it fails.

This is the image...

This is the edit that I needed to make...

Happy to work on this myself with a bit of guidance since I haven't used GitHub pages before.

Thanks!

[JOSE] Include reference to consolidated, finished product

In the GitHub section of the lesson, the author uses screenshots of a repository hosted at https://github.com/DamienIrving/data-carpentry to walk the student through the process of getting started on GitHub. Unfortunately, that repository doesn't actually exist - or at least, it's not publicly accessible! There's an opportunity here to improve the material in two small ways:

Go ahead and create this repository, or make it publicly accessible. Use it as a "living record" of the materials from the lesson, with all the files, commits, changes, and whatnot explicitly included therein.
Add a final "Summary" step to the lesson which reviews the "Key Points" listed at the conclusion of each module, and any other useful summary statements. This final step can explicitly linked to the reference GitHub repository which contains the finished product, which the student can use as a reference in case they wish to experiment further.

CC: openjournals/jose-reviews/issues/37

[JOSE] `code` directory that does not exist

In lesson 04-cmdline you refer to a code directory that does not exist if one has followed the install/setup instructions provided. Best to either ask users to put the .py scripts into a code dir or to refactor the paths to not refer to code/....
e.g. in lesson 04-cmdline the line $ cat code/script_template.py would become $ cat script_template.py.

CC: openjournals/jose-reviews#37

JupyterLab

Using JupyterLab when teaching the lessons would make switching between notebooks, scripts and the command line much easier.

The limitation at the moment is that the terminal application in JupyerLab defaults to the system terminal (i.e. the unix shell on linux and mac but the powershell on windows). There are ways to change this behaviour on windows to access Git Bash instead, but it's a little bit fiddly.
jupyterlab/jupyterlab#7154
https://medium.com/@konpat/using-git-bash-in-jupyter-noteobok-on-windows-c88d2c3c7b07

[JOSE] Debug/pdb example seems extraneous

The short segue discuss pdb and debugging in Lesson 4 seems a little extraneous. It's shown as an aside or a segue, but doesn't really reinforce the core point of the module (building and running more universal command line programs from your test scripts). If it's core material that the author wishes the student to dwell on a bit, then there may be two ways to better integrate it into the material:

Move it to the "Defensive Programming" or lesson
More tightly integrate it into the material

For instance, the GitHub material comes prior to the Defensive Programming which makes a lot of sense. One tool that the author could use to emphasize that debugging exists and is useful could be to accidentally commit a line with a syntax error in it. The author could then walk through fixing the issue in a simple way that introduces some more tools: First, the author could show pdb to help track down the bug. Second, they could use git blame to figure out when that line was changed. Finally, they could use git commit --amend or a revert to undo and fix the issue. The point of this short exercise would be to walk through a real-life, very common use case which the student will certainly encounter, and emphasize how these tools and techniques help minimize the pain that debugging causes with ad hoc development practices.

CC: openjournals/jose-reviews/issues/37

Add content on wrangling netCDF files?

It would be nice to add a lesson on wrangling netCDF files.

The premise for the lesson could be that we want to download some ERA-Interim precipitation data (total precipitation, synoptic monthly mean) to compare to our models. Wrangling that data would involve something like the following (from here) to have the ERA-Interim data dimensions and attributes match the CMIP5 data files:

ncpdq -P upk ${infile} {infile}
cdo invertlat -sellonlatbox,0,359.9,-90,90 -mulc,33 -monsum ${infile} ${outfile}
ncrename -O -v tp,pr ${outfile}
ncatted -O -a calendar,global,d,, ${outfile}
ncatted -O -a standard_name,pr,o,c,"precipitation_flux" ${outfile}
ncatted -O -a long_name,pr,o,c,"precipitation flux" ${outfile}
ncatted -O -a units,pr,o,c,"mm/day" ${outfile}

(The -mulc,33 covers mutliplying by 1000 to convert the units from m/month to mm/month and then dividing by 30 to crudely convert to mm/day.)

The main barrier to this lesson is that cdo isn't available on windows and the conda-forge recipe for Mac is broken: conda-forge/cdo-feedstock#15

I love the convenience of cdo, but the fact that it essentially only works on Linux machines is very problematic. To get around this, I could do the cdo parts of the wrangling on a Linux machine in advance and provide the final file for download - the participants could then do the nco parts?

[JOSE] Module description in JOSE paper

The paper for submission to JOSE is missing an explicit short description of what, exactly the learning materials are and the sequence that they would be introduced to a student. This is partially covered in the "Summary" section, but could really be broken down into its own section, which should include:

a list of the core skills introduced to the student
information about the sequence at which the student will encounter the new skills
examples of tangibles produced during the lesson (e.g. a Figure of one of the outputs from the precipitation plotting script, or a Listing with some code from that script)

CC: carpentrieslab/python-aos-lesson/issues/17

Add a map_blocks example

The large data lesson mentions that map_blocks and apply_ufunc can be used to write dask-aware functions, but doesn't actually do it. It might be useful to add a simple example.

A good resource is this example from NCAR-ESDS:
https://ncar.github.io/esds/posts/map_blocks_example/

Add content on dealing with large arrays?

People dealing with ocean data (due to the extra depth dimension) or high time frequency data (e.g. hourly data) tend to run into issues (like memory errors) due to the large size of their data arrays.

Some lesson content on Dask would be helpful here.

Updates to plot_precipitation_climatology.py

At the end of every lesson, use a solutions hidden drop down box to write out the full plot_precipitation_climatology.py script, as it should look after the challenges are complete. That way people who get behind can cut and paste to catch up.

Transition to new lesson infrastructure?

Would you be interested in help migrating this lesson to the new lesson infrastructure, The Carpentries Workbench? We recently published documentation for a semi-automated transition workflow, and I am also happy to work on this if you would like help?

Helper script references non-existent data file

The helper_lesson_check.md instructions reference data/pr_Amon_ACCESS1-3_historical_r1i1p1_200101-200512.nc in step (5), but this is neither the name of files downloaded as part of the setup process nor contained within the data folder of the repository.

Add content on workflow management?

This post discusses the issues: https://drclimate.wordpress.com/2015/04/06/workflow-automation/

An abbreviated version of the Software Carpentry lesson on Automation and Make might work?

Add "Getting Started" box to front page of lesson

I am working my way through the lessons in preparation for teaching it in the fall. I hit a snag right off when I moved to episode 1 from the entry page without doing the set-up first. The information for setting up the files and Conda is described (very well) in the Set-up tab. However, it was easy to miss.

The front page of the Python for Atmosphere and Ocean Scientists has blue boxes marked Prerequisites and Citation. I would add another blue box called "Getting Started", modeled on the entry page of the Ecology lesson. To emphasize it, I would move the "raster vs vector data" highlight and the citation to below the Schedule.

This is an easy change. However, this is a new lesson and I'm submitting this PR in partial fulfillment of the Carpentries instructor training requirements (so I don't know quite what I'm doing). In addition, the lesson's CONTRIBUTING.md is unclear because it includes defaults like "https://github.com/swcarpentry/FIXME". So I'd like some guidance from the lesson maintainers about how they would like me to make these changes.

BTW. As I work my way through the lessons, I hope to submit many issues and PRs as I go along.

Pangeo Binder as a backup

It might be a good idea to have Pangeo Binder available as a backup for those people who have problems getting everything working on their own machines. The Binder instances have 8GB of RAM and you basically make a GitHub repo available and specify an environment.yml file and then the users work from Jupyter Lab.

You can make a fancy button for people to click on in your README. e.g:

[![Binder](https://mybinder.org/badge_logo.svg)](https://binder.pangeo.io/v2/gh/ARM-Development/PyART-Training/HEAD?urlpath=lab)

[JOSE] Release versioning to match JOSE publication

Submission to JOSE is indicated as v1.0.0, but the most recent release tag is v9.3.1. I think tweaks could be made in either place (here or in the JOSE submission); if a new v1.0.0 release is cut here, it would be helpful to add a note to the README.md indicating how the sequence of version numbers diverges.

CC: openjournals/jose-reviews#37

Expand the vectorisation lesson to "xarray thinking"

At the moment the vectorisation lesson is very short and simply introduces the idea of not looping over arrays in Python.

It would be good to extend that lesson to talk more broadly about how to think about array operations using xarray. The following tutorial from @dcherian talks about working in index versus label space, and how to use methods like reduce and map to apply functions to arrays: https://github.com/ProjectPythiaTutorials/thinking-with-xarray_2022_03_09

In terms of the data analysis example in these PyAOS lessons, the concepts introduced in an extended xarray thinking lesson could be used to plot the seasonal climatology (i.e. four panels in one plot), custom seasons (e.g. 'NDJFMA', 'MJJASO'), apply spatial smoothing, etc.

Troubles with conda environments on windows

In preparation for this PyAOS workshop, participants are required to follow the Software Carpentry software installation instructions to install Anaconda. Windows users are also required to install a terminal emulator called Git BASH.

For the generic Software Carpentry Python lessons, this all works fine. Once Anaconda and Git BASH are installed, windows users can type "python" at the Git BASH command prompt to start a python command line session.

An unexpected problem I had upon teaching the PyAOS lesson on software installation using conda is that windows users can't type "conda" at the Git BASH command prompt (even if they follow our instructions correctly and check the box to make the Anaconda python the default python).

While this was annoying, it would be possible to get around this issue by re-writing the lesson so that packages are installed using the Anaconda Navigator GUI rather than at the command line.

The problem that I don't have an obvious solution for is that subsequent lessons require the participant to activate their conda environment (using source activate at the command line) and then execute a Python script.

$ source activate pyaos-lesson
(pyaos-lesson) $ python plot_precipitation_climatology.py -h

My recollections from the workshop is that like the command conda, source activate is also not available at the Git BASH command prompt.

I could remove discussion of conda environments from the lessons altogether, but the ability to export and share conda environments is a real game changer for reproducible research, so I'm reluctant to do that.

Does anyone have any suggested solutions to this problem? (I don't have a windows machine to play around with the configuration of Git BASH.)

carpentries-lab / python-aos-lesson Goto Github PK

python-aos-lesson's People

Contributors

Stargazers

Watchers

Forkers

python-aos-lesson's Issues

Recommend Projects

Recommend Topics

Recommend Org