carpentries-lab / python-aos-lesson Goto Github PK
View Code? Open in Web Editor NEWPython for Atmosphere and Ocean Scientists
Home Page: https://carpentries-lab.github.io/python-aos-lesson/
License: Other
Python for Atmosphere and Ocean Scientists
Home Page: https://carpentries-lab.github.io/python-aos-lesson/
License: Other
AUTHORS should include @DamienIrving to synchronize with JOSE submission.
In lesson 07-vectorisation
the first call to xarray.open_dataset has a syntax error as the xarray module is imported as xr
:
Change xarray.open_dataset
to xr.open_dataset
?
A number of improvements could be made by switching from a data analysis problem focused on precipitation data to once focused on an ocean variable:
cmocean
library(The vectorisation lesson could be a basin file rather than an sftlf file.)
When working on the second episode, when I get to change the units for the precip data, I get the following error.
AttributeError: cannot set attribute 'units' on a 'DataArray' object. Use setitem style assignment (e.g., ds['name'] = ...
) instead to assign variables.
The Carpentries Lesson Template includes a section on Checking and Previewing. There are scripts for that which are not included in the python-aos-lesson.
I would like to try using those scripts to render locally before I push changes to my repo. But it might not be needed if most development and editing is done directly in GitHub. If I use the scripts, I will PR a branch with them.
Hi @DamienIrving,
Towards the end of the large data episode, https://carpentries-lab.github.io/python-aos-lesson/10-large-data/index.html, pr_max.compute() is invoked to compute the max precipitation values across all times.
This pr_max.compute() call should:
--or--
From the xarray doc: https://xarray.pydata.org/en/stable/generated/xarray.DataArray.compute.html
Manually trigger loading of this array’s data from disk or a remote source into memory and return a new array. The original is left unaltered.
The link to the instructor notes (https://carpentrieslab.github.io/python-aos-lesson/guide/index.html) is dead. Thanks for fixing.
For the large data lesson, the instructor downloads and processes a 45GB dataset. It's impractical to require learners to download that dataset, so they don't get any hands on experience with actually processing large data.
A work around might be to simply run an xarray command to duplicate the monthly dataset used in the earlier lessons (e.g. repeat the time axis 500 times) so it's much larger?
This is a common issue, particular for data on a curvilinear ocean grid. Relevant notes can be found at the following issues:
There are general inconsistencies in how you refer to relative paths - sometimes data/...
other times .../data/...
. While most users can figure this out it would be better to go through the lessons and tidy this up
It might be good to reference this publication somewhere in the materials for this lesson. If you point me to the right place, I can do the edits and make a PR.
The CONTRIBUTING.md file still has the default information in it and needs to be updated with the specifics relating to this repository.
The xr.open_dataset call in lesson 02-visualisation
threw an TypeError for me. The error was:
TypeError: Error: .../data-carpentry/data/pr_Amon_ACCESS1-3_historical_r1i1p1_200101-200512.nc is not a valid NetCDF 3 file
If this is a NetCDF4 file, you may need to install the
netcdf4 library, e.g.,
$ pip install netcdf4
This is likely due to the an assumption that the conda env built by the users will have the python netcdf4 in it... but that was not the case for my set up so I have submitted an issue to add netcdf4 to the install process in lesson 01-conda
.
I recommend adding netCDF4 to to the install explicitly.
In the exercises of the large data lesson, learners are required to think about alternative options to Dask. The ARC Centre of Excellence for Climate Extremes have some notes that go through the options in detail. It would be good to have learners explore those notes so that they're aware of GNU parallel and the multiprocessing library.
For now I've put the new large data lesson in a new development directory (development/10-large-data.md
) so it doesn't appear on the website until we've sorted a few issues.
By using OPeNDAP, when we time how long it takes to calculate the daily maximum we are conflating the time it takes to access the data over wifi with the time it takes to actually process the data. On my laptop with the data stored locally, it takes 2 minutes to calculate the daily maximum on 1 core and 8 minutes on 4 cores (see this notebook). (And for some reason I'm still getting NaNs...) Accessing the data via OPeNDAP can mean the calculation takes anywhere from 20 minutes to over an hour depending on internet traffic in my neighborhood (at least I think that's the reason for the variability in how long it takes).
One solution would be for the instructor to have the data stored locally on their laptop. We could add a note in the lesson to tell the instructor to download the required data from their closest ESGF node. We already expect that students will just watch the instructor live code an example rather than run the code themselves, so we can just restrict the exercises to questions that don't require downloading the data. This would mean that for a given instructor on a given computer, the processing time would be repeatable and we could hopefully construct a data processing task that is faster in parallel than in serial (and hopefully also a reasonably short processing time so that it could be done live).
The problem with that solution is that a data processing task that is faster in parallel than serial on one computer (and that runs quickly enough to do live) might not be on another. I guess we could just tell the instrutor to show the lesson notes on the screen rather than running the commands themselves if they aren't able to download/store the data on their laptop or processes it in an acceptable time live.
Thoughts, @hot007?
@fmichonneau As the last step of publishing these lessons with the Journal of Open Source Education, I need to generate a DOI using Zenodo:
https://guides.github.com/activities/citable-code/
I had Zenodo send an authorization request to carpentrieslab, which I'm assuming came to you?
At my 2021 Dask Summit presentation about teaching Dask to atmosphere and ocean scientists it was suggested that content could be added about the Dask task graph and debugging / best practices for finding pain points.
It was suggested that this PyData talk might be useful:
https://www.youtube.com/watch?v=JoK8V2eWFPE
I really like the Defensive Programming unit, but I'm of the opinion that you may want to consider teaching assert
as a tool for validating user input. For one, explicit errors like ValueError
are more descriptive of the problem rather than the generic AssertionError
. The other problem is that if you run python as python -O
or python -OO
(which is short for setting PYTHONOPTIMIZE=1
), all assert
statements are ignored/compiled out. In general, assertions are supposed to be used to double check conditions that should never occur, not for validation of user input. Some other posts around this:
Just my $0.02.
On one hand I like the longer file names used in the lessons (e.g. pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412-SON-clim.png
) because it teaches good practices around data reference syntax, but on the other hand it's a lot of typing for instructors when live coding...
As we're running this lesson next week, I'm doing a review at present and noticing a couple of small improvements. I'll update the list as I work through things and submit another PR
print(dset['pr'])
does not provide the output given for attributes (it gets truncated if using the python REPL) so can add a call to pprint
to overcome this consistently - applies to 02 and 08logging.critical
is referred to as critcal (missing an i
)I'm going to use this issue to keep a list of other relevant lessons out there. At the moment I've got:
The content on data reference syntax from my old capstone lesson might be useful here?
GitHub has recently changed the interface at the time of repo creation to prompt creation of license, readme etc.
Also note GitHub seems to be pushing strongly toward use of SSH keys. I think if we can avoid it we should to keep this conceptually simple (ie avoid keys), but I think it would be good for @DamienIrving to update this lesson with new screenshots to ensure it's accurate with the current GitHub interface, and maybe mention why SSH keys are good for security.
Showing a zoomed in portion of the global map can provide further examples of using cartopy features such as:
This can be accomplished by doing ax.add_feature(cartopy.feature.STATES) or ax.coastlines('10m').
This is an important aspect for oceanographers because we often analyze data in specific regions, unless the study is specific to global data, and showing these features allows for better interpretation for data location.
@DamienIrving I think the following will solve the problem of adding metadata to images:
f = "test.png"
METADATA = {"History": "command line argument"}
# Create a sample image
import pylab as plt
import numpy as np
X = np.random.random((50,50))
plt.imshow(X)
plt.savefig(f)
# Use PIL to save some image metadata
from PIL import Image
from PIL import PngImagePlugin
im = Image.open(f)
meta = PngImagePlugin.PngInfo()
for x in METADATA:
meta.add_text(x, METADATA[x])
im.save(f, "png", pnginfo=meta)
im2 = Image.open(f)
print(im2.info)
In the GitHub section of the lesson, the author uses screenshots of a repository hosted at https://github.com/DamienIrving/data-carpentry to walk the student through the process of getting started on GitHub. Unfortunately, that repository doesn't actually exist - or at least, it's not publicly accessible! There's an opportunity here to improve the material in two small ways:
In lesson 04-cmdline
you refer to a code
directory that does not exist if one has followed the install/setup instructions provided. Best to either ask users to put the .py scripts into a code dir or to refactor the paths to not refer to code/...
.
e.g. in lesson 04-cmdline the line $ cat code/script_template.py
would become $ cat script_template.py
.
Using JupyterLab when teaching the lessons would make switching between notebooks, scripts and the command line much easier.
The limitation at the moment is that the terminal application in JupyerLab defaults to the system terminal (i.e. the unix shell on linux and mac but the powershell on windows). There are ways to change this behaviour on windows to access Git Bash instead, but it's a little bit fiddly.
jupyterlab/jupyterlab#7154
https://medium.com/@konpat/using-git-bash-in-jupyter-noteobok-on-windows-c88d2c3c7b07
The short segue discuss pdb
and debugging in Lesson 4 seems a little extraneous. It's shown as an aside or a segue, but doesn't really reinforce the core point of the module (building and running more universal command line programs from your test scripts). If it's core material that the author wishes the student to dwell on a bit, then there may be two ways to better integrate it into the material:
For instance, the GitHub material comes prior to the Defensive Programming which makes a lot of sense. One tool that the author could use to emphasize that debugging exists and is useful could be to accidentally commit a line with a syntax error in it. The author could then walk through fixing the issue in a simple way that introduces some more tools: First, the author could show pdb to help track down the bug. Second, they could use git blame
to figure out when that line was changed. Finally, they could use git commit --amend
or a revert to undo and fix the issue. The point of this short exercise would be to walk through a real-life, very common use case which the student will certainly encounter, and emphasize how these tools and techniques help minimize the pain that debugging causes with ad hoc development practices.
It would be nice to add a lesson on wrangling netCDF files.
The premise for the lesson could be that we want to download some ERA-Interim precipitation data (total precipitation, synoptic monthly mean) to compare to our models. Wrangling that data would involve something like the following (from here) to have the ERA-Interim data dimensions and attributes match the CMIP5 data files:
ncpdq -P upk ${infile} {infile}
cdo invertlat -sellonlatbox,0,359.9,-90,90 -mulc,33 -monsum ${infile} ${outfile}
ncrename -O -v tp,pr ${outfile}
ncatted -O -a calendar,global,d,, ${outfile}
ncatted -O -a standard_name,pr,o,c,"precipitation_flux" ${outfile}
ncatted -O -a long_name,pr,o,c,"precipitation flux" ${outfile}
ncatted -O -a units,pr,o,c,"mm/day" ${outfile}
(The -mulc,33
covers mutliplying by 1000 to convert the units from m/month to mm/month and then dividing by 30 to crudely convert to mm/day.)
The main barrier to this lesson is that cdo isn't available on windows and the conda-forge recipe for Mac is broken: conda-forge/cdo-feedstock#15
I love the convenience of cdo, but the fact that it essentially only works on Linux machines is very problematic. To get around this, I could do the cdo parts of the wrangling on a Linux machine in advance and provide the final file for download - the participants could then do the nco parts?
Other useful links:
The paper for submission to JOSE is missing an explicit short description of what, exactly the learning materials are and the sequence that they would be introduced to a student. This is partially covered in the "Summary" section, but could really be broken down into its own section, which should include:
CC: carpentrieslab/python-aos-lesson/issues/17
The large data lesson mentions that map_blocks
and apply_ufunc
can be used to write dask-aware functions, but doesn't actually do it. It might be useful to add a simple example.
A good resource is this example from NCAR-ESDS:
https://ncar.github.io/esds/posts/map_blocks_example/
People dealing with ocean data (due to the extra depth dimension) or high time frequency data (e.g. hourly data) tend to run into issues (like memory errors) due to the large size of their data arrays.
Some lesson content on Dask would be helpful here.
At the end of every lesson, use a solutions hidden drop down box to write out the full plot_precipitation_climatology.py
script, as it should look after the challenges are complete. That way people who get behind can cut and paste to catch up.
Would you be interested in help migrating this lesson to the new lesson infrastructure, The Carpentries Workbench? We recently published documentation for a semi-automated transition workflow, and I am also happy to work on this if you would like help?
The helper_lesson_check.md
instructions reference data/pr_Amon_ACCESS1-3_historical_r1i1p1_200101-200512.nc
in step (5), but this is neither the name of files downloaded as part of the setup process nor contained within the data
folder of the repository.
This post discusses the issues: https://drclimate.wordpress.com/2015/04/06/workflow-automation/
An abbreviated version of the Software Carpentry lesson on Automation and Make might work?
I am working my way through the lessons in preparation for teaching it in the fall. I hit a snag right off when I moved to episode 1 from the entry page without doing the set-up first. The information for setting up the files and Conda is described (very well) in the Set-up tab. However, it was easy to miss.
The front page of the Python for Atmosphere and Ocean Scientists has blue boxes marked Prerequisites and Citation. I would add another blue box called "Getting Started", modeled on the entry page of the Ecology lesson. To emphasize it, I would move the "raster vs vector data" highlight and the citation to below the Schedule.
This is an easy change. However, this is a new lesson and I'm submitting this PR in partial fulfillment of the Carpentries instructor training requirements (so I don't know quite what I'm doing). In addition, the lesson's CONTRIBUTING.md is unclear because it includes defaults like "https://github.com/swcarpentry/FIXME". So I'd like some guidance from the lesson maintainers about how they would like me to make these changes.
BTW. As I work my way through the lessons, I hope to submit many issues and PRs as I go along.
It might be a good idea to have Pangeo Binder available as a backup for those people who have problems getting everything working on their own machines. The Binder instances have 8GB of RAM and you basically make a GitHub repo available and specify an environment.yml file and then the users work from Jupyter Lab.
You can make a fancy button for people to click on in your README. e.g:
[![Binder](https://mybinder.org/badge_logo.svg)](https://binder.pangeo.io/v2/gh/ARM-Development/PyART-Training/HEAD?urlpath=lab)
Submission to JOSE is indicated as v1.0.0, but the most recent release tag is v9.3.1. I think tweaks could be made in either place (here or in the JOSE submission); if a new v1.0.0 release is cut here, it would be helpful to add a note to the README.md
indicating how the sequence of version numbers diverges.
At the moment the vectorisation lesson is very short and simply introduces the idea of not looping over arrays in Python.
It would be good to extend that lesson to talk more broadly about how to think about array operations using xarray. The following tutorial from @dcherian talks about working in index
versus label
space, and how to use methods like reduce
and map
to apply functions to arrays: https://github.com/ProjectPythiaTutorials/thinking-with-xarray_2022_03_09
In terms of the data analysis example in these PyAOS lessons, the concepts introduced in an extended xarray thinking lesson could be used to plot the seasonal climatology (i.e. four panels in one plot), custom seasons (e.g. 'NDJFMA', 'MJJASO'), apply spatial smoothing, etc.
In preparation for this PyAOS workshop, participants are required to follow the Software Carpentry software installation instructions to install Anaconda. Windows users are also required to install a terminal emulator called Git BASH.
For the generic Software Carpentry Python lessons, this all works fine. Once Anaconda and Git BASH are installed, windows users can type "python" at the Git BASH command prompt to start a python command line session.
An unexpected problem I had upon teaching the PyAOS lesson on software installation using conda is that windows users can't type "conda" at the Git BASH command prompt (even if they follow our instructions correctly and check the box to make the Anaconda python the default python).
While this was annoying, it would be possible to get around this issue by re-writing the lesson so that packages are installed using the Anaconda Navigator GUI rather than at the command line.
The problem that I don't have an obvious solution for is that subsequent lessons require the participant to activate their conda environment (using source activate
at the command line) and then execute a Python script.
$ source activate pyaos-lesson
(pyaos-lesson) $ python plot_precipitation_climatology.py -h
My recollections from the workshop is that like the command conda
, source activate
is also not available at the Git BASH command prompt.
I could remove discussion of conda environments from the lessons altogether, but the ability to export and share conda environments is a real game changer for reproducible research, so I'm reluctant to do that.
Does anyone have any suggested solutions to this problem? (I don't have a windows machine to play around with the configuration of Git BASH.)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.