terraref / reference-data Goto Github PK

View Code? Open in Web Editor NEW

9.0 9.0 2.0 369 KB

Coordination of Data Products and Standards for TERRA reference data

Home Page: https://terraref.org

License: BSD 3-Clause "New" or "Revised" License

Shell 12.46% R 87.54%

agriculture phenomics sensor-data

reference-data's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger hsiyjnd

reference-data's Issues

Fix invalid environmental_logger files

Need to re-process existing environmental logger files per #26

rename to fix typo in name
convert to valid json
move corrected files back to /projects/arpae/terraref/raw_data/ua-mac/EnvironmentLogger
delete invalid, misnamed files

something like:

for file in /projects/arpae/terraref/raw_data/ua-mac/EnvironmentLogger/*/*_enviromentlogger.json; 
do 
  cp $file $file.backup
  echo "[$(cat $file)]"  > $file
  sed -i 's/}{/},{/g' $file
  rename enviroment environment $file 
done

It may not be the most efficient, but it seems to work.

When everything checks out, we can rm *.backup

Create custom spatial reference systems for PostGIS

@yanliu-chn talked today about creating 3 custom SRS's for this project that I didn't capture very well in my notes:

Arizona MAC coordinate system
USDA coordinate system
(0,0) inversion

Could you say a tiny bit about these? I wanted to create an issue to capture that. For my own understanding, I found this:
http://geeohspatial.blogspot.com/2013/03/custom-srss-in-postgis-and-qgis.html
http://daniel-azuma.com/articles/georails/part-9

...where ultimately we might simply need to define our custom SRID in PostGIS with something like:

INSERT into spatial_ref_sys (srid, auth_name, auth_srid, proj4text, srtext) values ( 96703, 'sr-org', 6703, '+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs', 'PROJCS["USA_Contiguous_Albers_Equal_Area_Conic_USGS_version",GEOGCS["GCS_North_American_1983",DATUM["D_North_American_1983",SPHEROID["GRS_1980",6378137.0,298.257222101]],PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]],PROJECTION["Albers"],PARAMETER["False_Easting",0.0],PARAMETER["False_Northing",0.0],PARAMETER["Central_Meridian",-96.0],PARAMETER["Standard_Parallel_1",29.5],PARAMETER["Standard_Parallel_2",45.5],PARAMETER["Latitude_Of_Origin",23.0],UNIT["Meter",1.0]]');

(note that the authority name and ID are not required if the system is not specified by an outside authority like EPSG)

@robkooper
We could then define columns in Clowder PostGIS for each of the custom SRIDs - a setup script could check for some custom definitions file in Clowder, insert those into PostGIS if found, and change how the datapoints/etc. tables are defined accordingly. The geog column (which is currently the only one) could still be SRID 4326 and that can always be present, but then we could also have geog_96703 or other SRIDs for project-specific SRS's.

So for Terra we might end up with geog, geog_999001, geog_999002, geog_999003 where the 99900x are our 3 SRIDs. Then we require data to be submitted in one of the 4 reference systems. If a new sensor emerges with a new one, we add a column for it.

This is mostly thinking out loud... maybe I'm way off.

Define calibration protocols for each sensor

An initial draft of calibration protocols: https://docs.google.com/document/d/132_dkGAIQJ3cG7bQkPIkX7-RgXyWLDoQWJFDj5c-5uU/edit#

For each sensor, define:

sensor name
test object(s) / calibration panels
method of calibration
required experiments to validate the quality of calibration protocol

Standards Committee Meeting - Arizona

Proposed formats and databases for genomics data

(by Mike Gore and Elodie Gazave)

Genomic data have reached a high level of standardization in the scientific community. Below are the most widely accepted formats that are relevant to the data and analyses that will be generated in this project.

Today, all high-impact journals typically ask the author to deposit their genomic data in either or both of these databases before publication.

Overview of Genomics Pipeline

Details

Raw reads + quality scores

Raw reads + quality scores are stored in FASTQ format. FASTQ files can be manipulated for QC with FASTX-Toolkit

Reference genome assembly

Reference genome assembly (for alignment of reads or BLAST) is in FASTA format. FASTA files generally need indexing and formatting that can be done by aligners, BLAST, or other applications that provide built-in commands for this purpose.

Sequence alignment

Sequence alignments are in BAM format – in addition to the nucleotide sequence, the BAM format contains fields to describe mapping and read quality. BAM files are binary files but can be visualized with IGV. If needed, BAM can be converted in SAM (text file) with SAMtools

BAM is the preferred format for sra database (sequence read archive).

de novo sequence alignment

Not part of milestone, though some groups may wish to do this. Will implement if / when needed. Reference sequences will be 17x, (perhaps not enough for de novo alignment).

SNP and genotype variants

SNP and genotype variants are in VCF format. VCF contains all information about read mapping and SNP and genotype calling quality. VCF files are typically manipulated with vcftools

VCF format is also the format required by dbSNP, the largest public repository all SNPs.

Genomic coordinates

Genomic coordinates are given in a BED format – gives the start and end positions of a feature in the genome (for single nucleotides, start = end). http://www.ensembl.org/info/website/upload/bed.html BED files can be manipulated with bedtools

Genome annotations

Genome annotations are in GFF format GFF format contains genes and other genomic features. Allows “track” info for visualization http://useast.ensembl.org/info/website/upload/gff.html

Visualizing and annotating Genomes

Gbrowse is a comprehensive database + interactive web application for manipulating and displaying annotations on genomes.

Downstream

Analysis tools that all use SNP data (vcf) as input.

TASSEL-GBS
GAPIT
MLMM
R/qtl

Repository

Submit variant calls to Phytozome, has embedded Jbrowse

Begin recording site-level meta-data for TERRA experimental sites

~~record information that fits within the ICASA standard (e.g. #6) in a spreadsheet or json blob~~
Data beyond scope of ICASA standards in something like a 'native' format - that can be automatically generated, synced when updated, and parsed
- site layout including soils (in geojson (?))
- managements in csv file (site, management type, level, date)
  - fertilization, tillage, irrigation, etc.
- subdaily meteorological data (datetime, site, variable1, variable2, ...)
- cultivars
- experimental design (location of plants, plots)
- precision planter output
~~Insert into BETYdb~~

historical weather data

It would be great to have daily historical weather data (to include t_min, t_max, and precipitation) for Maricopa (or a nearby location) available somewhere. I will use this to build a weather model to base the regression on.

Vocabularies / ontologies to support, and framework for linking synonyms

Vocabularies to support and framework for relating synonyms across vocabularies, enable to search datasets with translation across vocabularies

Vocabularies to support:

CGIAR sorghum ontology:
- http://www.cropontology.org/ontology/CO_324/Sorghum%20ontology
- http://www.cropontology.org/ontology/CO_715/Crop%20Research
CF Standard name XML table http://cfconventions.org/Data/cf-standard-names/33/src/cf-standard-name-table.xml
BETYdb variables https://www.betydb.org/variables.json
ICASA Master Variable list

Related issues:

Add geospatial metadata to hyperspectral metadata

Currently the files only contain the start position. We don't have the width of the line and we don't have the location of each line.

Presumably the gantry knows the x location where each line is triggered.

We need the bounding box for the field of view
We also need to know the x position of each line

Additional questions:

is the grid regular with x and y orthogonal?
- e.g. does x change within a line / does x depend on y and vice-versa?
or could we start with bounding box and later reproject to rectangular grid

@solmazhajmohammadi and @FlyingWithJerome and @czender please update / clarify / revise this issue
@markus-radermacher-lemnatec suggestions?

Protocol to define access, re-distribution, attribution of documentation, system specifications, sample data

The standards committee requested information about the specification of sensors and their location on the gantry. This information is spread across multiple documents. Some could be shared in their entirety, others may need to be modified to have proprietary information redacted. Currently, we do not have a way of identifying the intended audience (e.g. some information that can be shared publicly is in the same folder or document as proprietary information).

We should create a simple protocol for identifying the intended audience and any restrictions on sharing and reuse.

We should first address the clear identification of any restrictions on sharing (Part 1). The mechanisms for controlling access (Part 2) are built into tools we use (Box, Google Drive, computer file systems).

Identifying the intended audience could be simplified by creating folders named "private", "internal", "shared", etc. But it would also be useful to clearly identify the intended audience in a file named README, COPYRIGHT, or LICENSE that indicates the owner, intended audience, and conditions for reuse.

There should also be an option to embargo some materials, e.g. where information required to set up the data pipeline might not be needed by pipeline users for a few years.

Private / Internal use

We define as private anything that has not been clearly marked for sharing. The owner / creator of information defines the audience.

Program-wide / Public / Shared

There is a subset of this information that will need to be made public before the data can be interpreted and reported in a scientific publication.

At this time, the standards committee (and others program wide) have requested

a physical representation of the LemnaTec system including where sensors are mounted
maps required to interpret experimental design
sensors specifications
calibration processes and standards, metadata, etc., will be important for interpretation

A short-term solution will be for a Lemnatec representative to identify and relocate information that can be shared publicly into a folder marked shared on box, such as https://uofi.box.com/terraref-box-shared. Internal documents can be kept where they are or placed in a subdirectory of a folder such as https://uofi.box.com/terraref-internal (e.g. a 'lemnatec' folder could be owned/adminstered by lemnatec)

Determine meta-data format for raw data from Lemnatec

Review and comment on and propose changes to filesystem and meta-data

Note below is a proposal, though worded as a declaration. Please indicate what needs to be changed, what could be changed, what is just right, and what really doesn't matter as long as it is clearly defined.

Summary:

use json with key/value pairs like key: value;
- ~~First proposal:: https://gist.github.com/dlebauer/75429d046badb100483b~~
- revised : https://gist.github.com/dlebauer/4a20f0d4512bbe2d1cdb
record any information that can vary or be modified on the sensor
- sensor temperature at time of measurement
- integration time
- gain
- offset
- time of day
- units
- diagnostics - any errors, if operating in appropriate ranges for sensor
- when was (and where is) last white reference
organization of sensor data:
- organize in folders as _ sensor / date / time_ (sensorname/YYYY-MM-DD/HH-MM-SStz/)
export 'fixed data' as a flat file with a time series, e.g. date / met / met.csv where the csv file will have a datetime field and time series of met observations.
- files will be transferred in discrete units (hourly, data) and transferred once complete (e.g. not continuously streamed)
meterological variables use "standard name" from climate forecasting conventions

Details

Meta-data

The current proposal is to use key-value pair json files for metadata, each data file and will be in a separate directory, alongside an info.json meta-data file. The meta-data will be comprehensive and redundant in order to be portable.

We will later extract meta-data from the metadata files into a searchable database (BETYdb).

Meta-data should contain anything that is not 'default' about the sensor, e.g. anything that can be set by the user (including position, optics, etc.). @robkooper suggested using url to point to some resources (Rob please clarify).

There has been some discussion of using json vs. xml. json is more flexible and preferred among many developers although XML formats can be more strictly defined using 'dtd'. Notably, the Postgres hstore extension provides a hstore data type and hstore_to_json and
hstore_to_jsonb functions (from "PostGIS in Action" 2nd
edition).

The scope of information that would be provided by an RGB camera is in this gist that contains the original proposal (info.xml, and other structures (info2.xml, info2.json)

Directory Structure

The directory structure will be nested by date then sensor, and divided into 'location independent' data like met:

/Date/Met/met.csv
/Sensorname/Date/Time/data.raw

Each raw datafile will be accompanied by a metadata file, e.g. info.json or info.xml metadata file.

This is based on a file system organization of all data. Examples on Box: https://gist.github.com/dlebauer/4a20f0d4512bbe2d1cdb

The folder structure can be accessed and is intuitive for users. The data within the folder structure is organized so that each sensor system stores an independent unit of data. This way, the different sensor systems stay independent from each other. The folder structure may be provided through an ftp server, a simple, robust, and widely used protocol.

Location Dependent (sensors)

definition: Sensors like cameras that fetches data only during an imaging job of the gantry system.

In the folder "locationDependent" you find a sub folder for each camera/sensor unit. The next sub folder structure divides the measurements by date (see folder "VisCamera"). For each measurement within a day an new folder is created that carries the timestamp. Within that folder you find a info.xml file containing all meta data and a data.XXX file that contains the raw binary data, as provided by the different sensors. The info.xml file is supposed to contain the full set of meta data required to interpret a specific measurement like camera type, camera settings, timestamp, location... (e.g. info.xml her. All meta data files will have the same layout. The binary data format depends on the different sensors/cameras and contains the data as provided by the sensor. In case of the Hyperspec cameras these files will be hypercubes, in case of the 3D scanner these files will be ".ply" point clouds.

Location Dependent (e.g. Meteorology)

definition: Sensors with a constant data flow, independent of the current location of the gantry, like wind or temperature sensors. These sensors will deliver data all the time, 24/7 in short intervals (or perhaps sync daily?).

The folder "locationIndependent" stores the constant stream of gantry location independent sensor data. It is divided in sub folders by date and then by timestamp. Each timestamp folder contains a single info.json file that contains the data and meta data of all sensors.

Converting lemnatec scanner metadata to absolute geographic positions

From @solmazhajmohammadi on March 18, 2016 20:31

@czender, At the moment there is no accurate GPS at the gantry box. It is planned to add geographical coordinates to metadata, based on the fixed point on the ground and a RTK GPS location.
Here are locations from each corner of the field that @rjstrand sent from his phone:

SE Corner 33° 04.470' N / -111° 58.485' W
SW Corner. 33° 04.474' N / -111° 58.505' W
NW Corner. 33° 04.592' N / -111° 58.505' W
NE Corner. 33° 04.591' N / -111° 58.487' W

I have used linear scaling formula to transfer the coordinates, it should be enough for the current GPS location.

To determine the location of each pixel within the image, Extrinsic and Intrinsic calibration matrices are needed.
3 coordinate systems exist:

Field coordinate frame (Xf , Yf , Zf)
Camera coordinate (Xc , Yc , Zc)
Image coordinate (Xp , Yp)

1 ) Extrinsic calibration matrix:
The extrinsic calibration parameters specify the transformation from the field to the camera coordinates, and it is represented as,
[Xc Yc Zc]'= Ro [Xf Yf Zf]'+ to
Since there is no rotation (Ro), MEX is only translating vector to. to is transition vector from the control point (0,0,0) to camera position [Xc Yc Zc]. Considering that gantry is moving with constant speed in x direction, and metadata information shows the starting time and start position of a scan.
to = [+xg+Vx*(t) +yg +zg]'

Where (xg ,yg ,zg ) is the camera position in gantry box. For Hyperspectral camera (xg, yg ,zg ) is (1.9 , 0.855 , 0.635). Vx is velocity in x direction. t represents time difference. ()' is transpose operation.

2 ) Intrinsic calibration matrix:
Notes:

The geometric distortions that exist between the 272 bands are considered to be small.
No rectification is done.
No distortion and lens model is considered.
Mirror angle change is not considered.

The intrinsic calibration parameters specify the transformation from the camera coordinate to the pixel coordinates.
Coordinate transformation between camera plane and image plane:
[Xp Yp f]'= A [ I 0 ] [Xc Yc Zc]'
Simple orientation matrix is:

αx = αy =α and it's focal length divided by pixel pitch. u0 and v0 denote the principal point (ideally center of image)
For SWIR camera focal length is 24 mm, and pixel pitch is equal to 25 um. I dont know how much is γ. I have contacted the Headwall group and hopefully soon, they will provide me with more information on calibration of the camera. So, I will update your response soon.

Copied from original issue: terraref/documentation#9

Mislabeled aperture in metadata

While the VNIR is uninstalled I have taken the opportunity to try out different aperture settings on the StereoVIS system while the RGB cameras are accessible. I have just discovered that the cameras have been physically set at f/16 while the metadata has been reporting them as set at f/4.

I have taken the liberty of changing the fixed metadata file so it will write correct values moving forward. This does mean that all past metadata files should be edited to be correct.

Our plans for settings on the cameras in the future are now much less limited. My initial "by eyeball" look at the output of the cameras, even at f/8 with gain and exposure both significantly reduced, is that image quality will be much improved. If the cameras were indeed intended to be set at f/4 and the DoF is not an issue I anticipate no further issue with noise from high gain, as well as greatly reduced if not eliminated motion blur.

Proposed format for meteorological variables exported from Lemnatec platform

PEcAn uses Climate Forecasting 'standard names' and 'canonical units' conventions (widely used in climate / met community) for meteorological and ecosystem-level mass and energy balance variables.

Here are some examples (note that we can change from canonical units to match the appropriate scale, e.g. "C" instead of "K"; time can use any base time and time step (e.g. hours since 2015-01-01 00:00:00 UTC, etc. But the time zone has to be UTC, where 12:00:00 is approx (+/- 15 min). solar noon at Greenwich.

Examples:

CF standard-name	units
time	days since 1700-01-01 00:00:00 UTC
air_temperature	K
air_pressure	Pa
mole_fraction_of_carbon_dioxide_in_air	mol/mol
moisture_content_of_soil_layer	kg m-2
soil_temperature	K
relative_humidity	%
specific_humidity	1
water_vapor_saturation_deficit	Pa
surface_downwelling_shortwave_flux_in_air	W m-2
surface_downwelling_photosynthetic_photon_flux_in_air	mol m-2 s-1
precipitation_flux	kg m-2 s-1
wind_speed	m/s
eastward_wind	m/s
northward_wind	m/s

standard_name is CF-convention standard names
units can be converted by udunits, so these can vary (e.g. the time denominator may change with time frequency of inputs)

How to calculate downwelling spectral radiances

Currently the spectral radiances are uncalibrated, and provided in the environmental logger as:

    "spectrometer": {
      "maxFixedIntensity": "16383",
      "integration time in ?s": "5000",
      "band": {
        "wavelength": 337.70483,
        "spectrum": 1500.0
      },
      "band": {
        "wavelength": 338.16013791719934,
        "spectrum": 1500.0
      },
      "band": {
        "wavelength": 338.61548740418232,
        "spectrum": 1503.0
      },
      "band": {
        "wavelength": 339.07087845402685,
        "spectrum": 1500.0
      }, ...

from
2016-04-13_00-38-15_environmentlogger.zip

Calibration files are in EnvironmentLogger/CalibrationData/
Calibrations.zip

The output of the spectrometer is 'raw' counts.

@TinoDornbusch in #26 you wrote

You need to use the attached calibration files to convert it to units of µW m-2 s-1. Careful you need to take the bandwidth of the chip into account (0.4nm) if you want to convert to µmol m-2 s-1.

Could you or @markus-radermacher-lemnatec please clarify, and provide an equation for converting the information in the calibration files to reflectances?

Upload 2016 first season field measurements to BETYdb

TERRA-Ref Maricopa Sorghum 2016 first season summary:

Planting date: April 19
Harvest dates: July 11-14
Crop duration: 82 days post planting

Summary of manually-collected field measurements (ground truth data collection):

Define expertise of current committeemembers

This will help understand who can help where, and what we need to fill in.

create a google form to let people check boxes (or select expertise levels 1-5) as well as provide free-form text responses
send to committee members
categories: remote sensing, image analysis, physiology, breeding, genomics, informatics, ...
levels: 0: never heard of it, 1: have heard of it but don't know, 2: have read the wikipedia page 3: could contribute to the wikipedia page (grad level) 4: use it as a tool actively in research 5: leader in field / expert in developing new methods

Over-exposed Hyperspec data

@czender Can you please take a look at the SWIR data from the following dates:
"time": "06/08/2016 10:57:09"
"Time": "05/06/2016 10:51:25"
It seems that all the bands are overexposed.
@TinoDornbusch any changes have been made on camera setting since last month?

Nominations for external members

We need a minimum of three additional external members for standards committee

~~[ ] create google form to request name, institution, contact info and expertise~~
request nominations from reference team
review at next teleconference

Feedback on initial draft of hyperspectral data file format

The first sample of hyperspectral image data is ready with units of "Exposure on scale from 0 to 2^16-1 = 65535". Data available as 134MB file foo.nc and much smaller metadata text file foo.cdl.

This has been prepared using the hyperspectral/terraref.sh script.

People will be most interested in looking at band-specific data so the default script is simplified for this.
Lossless compression by Deflate is 20-25% with no loss of data with less than 5 s to decompress.
Higher compression will cause longer decompression time
Lossy compression is an option (bit rounding), then compressing with lossless compression to get an extra 10% saving per decimal digit
In addition to netCDF format, there is interest in geoTIFF; gdal provides tools so that we can allow conversion to geoTIFF and other files for download on demand.

Next steps include calibration and conversion of exposure to reflectance as defined in #14.

Add level to calibration test object

For consistent interpretation of calibration panels, Larry Biehl suggests making sure that the panel is level.

Would it be feasible to attach a circular level similar to this one to the test objects?

Protocols for identifying IP, licensing, assigning copyright, conditions for reuse (for data, software)

Most of the data generated by the Cat 5 platform is intended for distribution and reuse with attribution (e.g. CC-BY or BSD-compatible license) or anything goes (CC-0). It is equally important to allow restrictions on access to and use of proprietary algorithms and data product.

What specific features are required? What solutions are available?

In particular:

What are technical specifications for access control? E.g.: users on a unix filesystem, database, web-interface, API.
What are the legal protocols for clearly specifying the conditions of use / reuse? All products and software be clearly identified with a copyright or software license. How should this be done? A LICENSE file in the root directory? Copyright stated in the metadata?
What types of NDA's or data sharing / use / attributions should do we need?

Why should a researcher release code / data / etc before publication?

because it is the TERRA REF team mandate - current plans are to release these to maximize reuse with attribution (e.g. MIT or CC-By). However, this could be done after some embargo, e.g. 6-12 months. And we aren't technically on the hook for public release until Nov 2018 (though for the most part, we are making data and code available as we produce and develop it).

So ... even if we make these resources public, they may not start with permissive licenses. In addition to acknowledgement, prior to planned open release we could require co-authorship. If we make the conditions of use and reuse clear, disobeying these conditions would be unethical, and possibly illegal. An academic violating such conditions would risk institutional or professional society discipline and publication retraction. Stealing IP for profit would be theft (though difficult to catch).

Create a list or catalog of data products

Creating a list or catalog of existing and planned data products for the TERRA project -- more than just the individual sensor metadata. It might be worth compiling a catalog similar to the NEON project (http://www.neonscience.org/sites/default/files/basic-page-files/NEON_DOC_002652.pdf). Based on the NEON data catalog, possible attributes of a data product description (e.g., catalog record) are listed in the "Data product catalog" section of the following Google Doc:

https://docs.google.com/document/d/13gXD_OVLffm0hqahDZ3tUvru8IV1fRfM6DiuOcfjr3s/edit?usp=sharing

The resulting list should be put in the documentation repository.

Sensor precision, and truncation

For each sensor, what is the instrument precision, and how does this translate to the numeric precision that we need to store.

Using floats with a fixed number of significant digits. Not optical sensor provides more than 5 significant digits, so single precision float is more than enough.

The origin of the coordinate system of pointclouds

@ZongyangLi @pless
Right now the origin of the coordinate system is set to the location of calibration object during the first calibration run.

We can change the origin in the configuration file and reprocess the data. (Requires Fraunhofer to change it)
Or we can find the exact location by some test scans and define a new origin and add a transformation matrix to the metadata.
We can go for the second option and @smarshall-bmr can do the scans to find the exact origin.

Document the parts numbers and specifications for calibration panels / objects?

We need to know the part numbers and specs for the calibration panels we are using. For example, SphereOptics says that for each panel

Calibration will be performed on a Perkin Elmer Lambda 19, data will be supplied electronically in 1 nm steps, 50 nm step printed documentation with NIST/PTB traceability with certificate for the range from 250 nm-2500 nm.

We need to collect this documentation. Please attach such documents to this issue.

It will also be useful to have pictures of each panel made with a standard RGB camera.

object	manufacturer	part number	comments
white reflectance panel	Spectralon	?
munsel color chart	?	?
3D object	?	custom	made of solid aluminum with an accuracy of 0.1mm and laser engraved checkboard pattern

are there any others?

Write protocols for field measurements

For field measurements, clearly specify how they should be taken so that they can be reproduced across sites.
Identify and review existing protocols
- Pérez-Harguindeguy et al 2013, covers plant traits in general
- genomes2fields standard operating proceedures
Solicit feedback from TERRA teams and standards committee
Move to terraref documentation

First draft is here: https://docs.google.com/document/d/1iP8b97kmOyPmETQI_aWbgV_1V6QiKYLblq1jIqXLJ84/edit

Example of a big-ish phenomics dataset, for feedback

A simulated dataset

cross posted on our website

The intent is to begin to get feedback on some rough sketches of what some data products might look like.

To this end, I have simulated the type of data that a sensor might observe, along with some of the underlying environmental drivers and physiological traits.

Note that there will be numerical artifacts, quasi-meaningful error terms, and liberal re-application of core concepts for the purposes of developing these datasets.

All of these simulated datasets are released CC-BY-0: do with as you please, but these are not production quality - just trying to meet demand and begin getting feedback.

A note on variable names

I have used the names currently used in BETYdb.org/variables, along with names inspired by the more standardized naming Climate Forecasting conventions. However, at this point this is a very early pre-release, and comments on how such data should be formatted and accessed can be discussed in issue #18.

Design of the Simulation Experiment

Overview

227 lines grown at each of three sites along a N-S transect in Illinois over five years (2021-2025). Two years were dry, two were wet, and one was average.

Years

These are historic data, but the years have been changed to emphasize the point that these are not real data.

year	drought index
2021	wet
2022	dry
2023	normal
2024	wet
2025	dry

Sites:

These are approximate locations used to query the meteorological and soil data used in the simulations.

site name	latitude	longitude
north	42.0	-88.5
central	40.0	-88.5
south	37.0	-88.5

Each site has four replicate fields: A, B, C, D. This simulated dataset assumes each field within a site has similar, but different meteorology (e.g., as if they were all in the same county).

Genotypes

Two-hundred and twenty-seven lines were grown at each site. They are identified uniquely by an integer in the range [9915:10141]

Phenotypes

The phenotypes associated with each genotype is in the file phenotypes.csv.

These 'phenotypes' are used as input parameters to the simulation model. We often refer to these as 'traits' (as opposed to biomass or growth rates, which are states and proceses). In this example, we assume that 'phenotypes' are time-invariant.

variable_id	name	standard_name	units	Description
	genotype			genetically and phenotypically distinct line
	Vmax		umol m-2 s-1	maximum carboxylation of Rubisco according to the Collatz model
38	cuticular_cond	conductance_of_fully_closed_stomata	umol H2O m-2 s-1	leaf conductance when stomata fully closed
15	SLA	specific_leaf_area	m2 kg-1	Specific Leaf Area
39	quantum_efficiency	mole_ratio_of_carbon_dioxide_to_irradiance_in_leaf	fraction	see Farqhuar model
18	LAI	leaf_area_index	m2 leaf m-2 ground	Leaf Area Index
31	c2n_leaf	mass_ratio_of_carbon_to_nitrogen_in_leaf	ratio	C:N ratio in leaves
493	growth_respiration_coefficient	respiration_coefficient_for_growth	mol CO2 / mol net assimilation	amount of CO2 released due to growth per unit net photosynthesis
7	leaf_respiration_rate_m2	respiration_rate_per_unit_area_in_leaf	umol CO2 m-2 s-1	Not really ""dark respiration"" Often this is respiration that occurs in the light. Date and time fields ""should"" identify pre-dawn (nightime/dark) leaf resp vs the Rd that comes from a A-Ci or A-PPFD curve
4	Vcmax	rubisco_carboxylation_rate_in_leaf_assuming_saturated_rubp	umol CO2 m-2 s-1	maximum rubisco carboxylation capacity
404	stomatal_slope.BB	stomatal_slope_parameter_assuming_ball_berry_model	ratio	slope parameter for Ball-Berry Model of stomatal conductance
5	Jmax	electron_transport_flux_in_thylakoid_assuming_saturated_light	umol photons m-2 s-1	maximum rate of electron transport
492	extinction_coefficient_diffuse	extinction_coefficient_for_diffuse_light_in_canopy		canopy extinction coefficient for diffuse light

Simulated Sensor Data

This dataset includes what a sensor might observe, daily for five years during the growing season.

note A sensor won't observe roots or rhizomes. Furthermore, Sorghum doesn't have rhizomes. The simulated biology is a little different.

variable_id	name	standard_name	units	Description
	sitename			Name of site
	plotid			experimental replicate plot
	year
	date		YYYY-MM-DD
	Stem	stem_biomass_content		Mg / ha
	Leaf	leaf_biomass_content		Mg / ha
	Root	root_biomass_content		Mg / ha
	Rhizome	rhizome_biomass_content		Mg / ha
18	LAI	leaf_area_index	ratio	Leaf Area Index is the ratio of leaf area to ground area
	NDVI	normalized_difference_vegetation_index	ratio	commonly used vegetation index
	Height	canopy_height	m

How to obtain data and give feedback:

Please provide feedback. Please provide feedback to [email protected], visit the TerraRef Reference Data chatroom or comment in our GitHub repository.

If you do something cool, please send comments and figures!

Data are located on Box: https://uofi.box.com/sorghum-simulation

observations.csv: simulated observations
phenotypes.csv: physiological traits associated w/ genotype (assumed time invariant)
met.csv: daily temperature and precipitation summaries

Updates and revisions to Lemnatec Field metadata

Here is a recent meta-data being generated by the Lemnatec Field system is here: https://gist.github.com/dlebauer/ccc1940fefbacaa60296

We will also need the following information:

the position of each sensor within the box
the position of the 0,0,0 point
additional time information
- for sensors like VNIR/SWIR that are operated in line-scan mode, presumably we need the start time as well as the end time (currently it is not clear if the time stamp is the time that the file was finished writing, or otherwise)
- for PSII sensor, we will need the time of the flash as well as the time that the fluorescence response was captured
Time and location of any calibration / correction data (e.g. dark and white reference, etc)
Calibration / correction data
- 3x3 K matrix for geometric correction / transformation from sensor view to reality
- Bandwidth, spectral resolution, FWHM, and / or spectral response
- Downwelling solar radiation.
- temperature (for FLIR)
- Field of view (bounding box on a plane at 2m below sensor bay, which we assume is top of canopy)

I'll note that any data that is stable in time (e.g. especially calibration data, but also position and structure of each camera) could be stored separately.

Anything else?

Document sensor calibration protocols

From 2016-06-03 meeting, @LTBen offered to draft sensor calibration protocols, and would do this with help from @smarshall-bmr , @JeffWhiteAZ, @TinoDornbusch

Here is the list of sensors on google drive

Draft of Calibration Protocols

For each sensor, define:

sensor name
test object(s) / calibration panels
method of calibration
required experiments to validate the quality of calibration protocol

Robinson and Biehl 1982 Calibration Procedures for Measurement of Reflectance Factor in Remote Sensing Field Research

Define Use Cases that end-user API to query data should support

define use cases (@dlebauer)
what features should an API have?
- query by genotype? query by location? time? sensor type? individual plants (entities in BETYdb).
- what features will make it easy to document and integrate with other API's?
- What is the value of an API Framework such as Swagger?
what framework does the current BETYdb API use? What features does it have? What would make it easier to use / extend?

Use cases

def: a few sentences that describe a task someone wants to accomplish.
will help prioritize feature development and project organization, and get feedback.

ex1: someone is looking at an image in Clowder, have identified a particular trait, and want to find all plants with this trait (within some range or greater than, e.g. top 10% biomass) and the find other data assoc. w/ these plants

ex2: I have an interesting thing I’ve noticed, can I find all plants w/ same feature +/- X%

ex3: Want to upload data so someone else can get to it and its metadata

ex4: want to publish a collection from Clowder

references

Develop talking points for project

Ask Nick for Claire's time
Compile information for Claire
Schedule time to meet with Claire
Claire finishes draft 1
Feedback from @dlebauer on Draft 1
Claire finishes draft 2
Feedback from @dlebauer on Draft 2

Virtual Standards Committee Meeting

Determine date (week of Oct 19 doodle poll: http://doodle.com/poll/76rs4y9qv78kbc79#table)
Determine location
Invite Committee Members
Determine scope and content
Develop agenda

Formats for traits, phenotypes, and agronomic managements

Define traits that will be used in GWAS study, including variable names, units, and methods of measurement / validation.

for example:

name	units	description	method of measurement
SLA	m2 kg-1	Specific leaf area leaf area per unit mass	hole punch

Ultimately, these should be entered into the trait database which will look like this: https://www.betydb.org/variables/15.

what is the core set of traits that teams are interested in measuring for calibrating algorithms to derive traits from sensors, predict yields and crop fitness, and use in genomic analysis? (@JeffWhiteAZ)
- a few easy ones include specific leaf area, height, leaf area index, leaf number
- what other simple sensor-derived values, such as mean greenness by plot or plant, NDVI, and other indices
Is there a set of trait names, canonical units, and definitions (similar to CF Standard Names that we can use for these e.g. specifically what does ICASA provide? (@chporter)
- ans: We will support ICASA, CF, augmented as needed
- add thesaurus table / use BETYdb standard_name, canonical units fields by default (newly added mostly empty fields in variables table) https://www.betydb.org/schemas?partial=variables_table
How can we distinguish among sensor-derived and field (e.g. calibration / validation) measurements?
- ans. BETYdb allows each trait record to be linked to a method defined in a separate table
  this method of measurement, e.g. using a hole-punch vs. whole-leaf measurement of a leaf mass / area could affect the magnitude of the measurement, but at the plant scale is interpreted as the same trait. With calibrating precise sub-leaf measurements, we can subset data by method of measurement without distinguishing the variable names.
- in addition, when different models are used to estimate a parameter, and the parameter is only relevant for a specific model formulation (e.g. Ball-Berry, Leuning, and Medlyn variants of stomatal slope used to estimate leaf-level gas exchange), we use different records in the variables table, because the values are clearly defined as distinct.
- this is consistent with PEcAn (pecan project.org) RTM inversion

Managing Synonyms:

Tool for managing synonyms integrated w/ Clowder:

Knowledge information systems http://ecgs.ncsa.illinois.edu/KIS.html

Other vocabularies:

CGIAR sorghum ontology:
- http://www.cropontology.org/ontology/CO_324/Sorghum%20ontology
- http://www.cropontology.org/ontology/CO_715/Crop%20Research
CF Standard name XML table http://cfconventions.org/Data/cf-standard-names/33/src/cf-standard-name-table.xml
BETYdb variables https://www.betydb.org/variables.json
- see also spreadsheet with CF-style names mapped to BETYdb variables table
ICASA Master Variable list: https://docs.google.com/spreadsheets/d/1MYx1ukUsCAM1pcixbVQSu49NU-LfXg-Dtt-ncLBzGAM/pub?output=html

Add example data formatted in ICASA standards

@JeffWhiteAZ has data from the Hot Cereal Cereal experiment and cotton yield trials that comply with ICASA variable naming conventions, stored in both JSON and spreadsheet format.

add both spreadsheet and json formatted data to the Box folder (https://uofi.box.com/terraref); we can use this as a starting point for discussions of data and metadata formats.

Give Roger access to beta testers

Noah Fahlgren, Rob Pless at WUSTL and Charlie Zender

Include sample data

Private owncloud server?

Quick inventory of data available

Just had a query for wheat data. Do we know if any usable data were obtained from the Scanalyzer?
Is Clowder the best place to monitor data availability? I'd guess I'd like a data overview that tells the status of anything in the pipeline - raw, initial QC ... final product

Expand mailing list

add users
request that users encourage their postdocs and graduate students to sign up

Write statement of work descriptions for all of the contributing groups

replaced by terraref/computing-pipeline#10

Write a document that includes definitions of

specific tasks and milestones for each group at NCSA
contributions expected by other groups within the program (DDPSC, MAC, KSU, WUSTL, Hudson Alpha)

SWIR out for service

Friday 8/26 the SWIR was uninstalled because its sensor thermoregulation system is failing to maintain a constant temperature. We are currently waiting for an RMA tag from Headwall before we ship it out for repairs. LemnaTec is heading this process but I'll act as a liaison and keep everyone updated as the RMA process continues. If the RMA on the VNIR is anything to go by then we will be without the SWIR for at least a month.

Metadata for new Weather Station

What sensors are on board?
How are they calibrated?
What do column names mean?

Proposed format for hyperspectral and other imaging data

This is a proposal for spectral and imaging data to be provided as HDF-5 / NetCDF-4 data cubes for computing and downloading by end users.

Following CF naming conventions [1], these would be in a netcdf-4 compatible / well behaved hdf format. Also see [2] for example formats by NOAA

Questions to address:

what is the scope of data products can be produced in this format?
what meta-data is required?
what tools are available for converting to and from this format?
what are other options, advantages / disadvantages?

Radiance data

Variables

variable name	units	dim 1	dim 2	dim 3	dim 4	dim 5
surface_bidirectional_reflectance	0-1	lat	lon	time	radiation_wavelength
bandwidth	0-1	lat	lon	time	radiation_wavelength
upwelling_spectral_radiance_in_air	W m-2 m-1 sr-1	lat	lon	time	radiation_wavelength	zenith_angle

note: upwelling_spectral_radiance_in_air may only be an intermediate product (and perhaps isn't exported from some sensors?) so the focus is really on the reflectance as a Level 2 product.

Dimensions

dimension	units	notes
time	hours since 2016-01-01	first dimension
latitude	degrees_north	(or alt. projection_y_coordinate)
longitude	degrees_east	(or alt. porjection_x_coordinate below)
projection_x_coordinate	m	can be mapped to lat/lon with grid_mapping attribute
projection_y_coordinate	m	can be mapped to lat/lon with grid_mapping attribute
radiation_wavelength	m
zenith_angle	degrees
optional
sensor_zenith_angle	degrees
platform_zenith_angle	degrees

[1] http://cfconventions.org/Data/cf-standard-names/29/build/cf-standard-name-table.html
[2] http://www.nodc.noaa.gov/data/formats/netcdf/v1.1/

Create a metadata file for fixed sensor / field scanner meta-data

Description

I propose that we keep a separate file containing 'fixed sensor metadata'. It should have the same fields / schema that are currently in the *_metadata.json files as well as additional information.

We should not change the current _metadata.json files that are generated, just have a new canonical source for the information.

Context

There have been a number of times when the 'fixed' information has been changed or updated. This will make it easier to update meta-data without having to rewrite the original metadata files that had incorrect information.

Here are some of the issues that such a static (or infrequently updated) file would help:

incorrect camera location #40
mislabeled aperature in meta-data #39
changes to metadata #25
incorrect field of view terraref/computing-pipeline#126

Further Suggestions / Request for Feedback

What additional fields should it contain?
- date range for data that it applies to
- who is responsible for writing it
- location of sensor calibration files
- others?
How should it be stored?
- start with .json files named sensor_metadata.json
- put in the gantry data stream near calibration files
- made accessible for programmatic access via an API (clowder?)

Define data products, repositories, publication venues

We will be producing diverse data products, protocols, and software (hereafter ‘products’). These are intended for distribution and reuse with attribution Products will be continuously revised, updated, and expanded. New versions released annually, with opportunity for community feedback. Other teams will potentially contribute protocols, similar types of data. Data sets from different teams could be merged for analysis (ie more than one attribution)

Key Products:

raw data and data products
- fifteen + sensors,
- trait data: direct measurements and sensor derived
- genomics
software
- pipeline components like Clowder
- databases (schemas)
- interfaces
protocols
- field measurements
- sensor calibration

Approach

Define products defined by 'level' of processing (sensu NASA)
Clearly mark data with conditions of reuse (CC-By by default) and how to cite
Versioning of products by year of release
- each year a new version of data products will be released
- new version will have increased number of data products, some data products will be re-analyzed
- define what data can be 'retired', based on information content or reproducible pipeline from raw data
Data repositories for different data products
- Illinois data bank (UI Library) for large sensor data
- Genomics data: phytozome?
- Others? BMS hosted by Cyverse.
Submit letter of inquiry to Nature Scientific Data or similar venue regarding one or more data papers to define scope of data.

See also meeting notes: https://goo.gl/QhpwcH

~~TODO~~

Camera Position in metadata

Recorded camera location in camera box for some of the sensors (including SWIR) is not correct. Stuart is going to fix it.

Changes to data stream from environmental sensors

The first environmental data samples (e.g. 2016-02-15_21-20-08_enviromentlogger.json.txt) are in a json key:value format.

I propose the following changes:

write one file per hour (rather than every 2 minutes)
use variable names and units defined in #3 to avoid confusion
write all variables except downwelling spectrum into a table (csv or netcdf) with a time stamp or dimension
- add ambient CO2 from the moving sensor to this file.
- co-locate CO2 sensor with other met sensors (or is there a reason to have it on the bay?)
write a separate file to contain the downwelling spectral radiance, which should be a file similar to format in #14 (but lacking the x and y dimensions)

Note that currently the file has variables spectrum and wavelength but nothing measuring irradiance

ensure that files are valid json (see below)
Please restrict the text meta-data files to ASCII. Some '?'s appear that seem to have been μ "umol" or "micromol" would work.

@markus-radermacher-lemnatec

Images from full grown plants

In order to validate the algorithm for image alignment and stitching, I need images from full-grown plants.

Review existing standards, conventions, and ontologies. Which should we use, adopt, support, learn from?

Our goal is to create data products that are easy to access and use.

There are a few classes of data:

Crop physiological traits (#18)
Agronomic meta-data (#18)
Sensor Output (#14)
Derived metrics from image analysis (spectral indices like NDVI, geometric statistics like height, convex hull)
Genomics
(others? is this the correct list?)

For each class of data:

what are the key existing standards / conventions?
what is on the list or in the wild that we should not adopt
what are general principles and where is minimum meta-data information content defined?
what are the widely accepted 'unique identifiers' that we clearly should adopt ?
- (e.g. UTC for time, lat lon for location, Binomial classification (Linnean Latin names) for species)?
what are the use cases (queries that we should support)?

Each data format should have brief description, focusing on

features relevant to TERRA
existing tools and applications
ease and value of integration / adoption / adherance
general recommendation
TODO
- Update list of standards in documentation
- discuss in comments (below)
- summarize information and contribute or add to comment below

Notes

This is a proposal open for comments and contributions. We plan to update these specifications annually, starting with v0 in Nov. 2016

Upload plot center coordinates to clowder

From @LTBen: The researchers marked the measurement “center” of each range (plot). Stuart and Tino collected the gantry coordinates of these positions

@smarshall-bmr and @TinoDornbusch could you please send these coordinates to me or upload them to google drive?

Establish daily data transfer from Rothamstead.

@yanliu-chn

Import MB-GB - scale sensor data daily from Rothamstead.
Import raw data files and a meta-data database.

What method of transfer? (ftp? ssh? other?)
What server can we put data on now? Ideally scalable, or we can re-assess location at 1 TB.
- setup daily automated import (rsync + chron) of files from Rothamstead
setup Postgres server; start with daily dump / import of Rothamstead Lemnatec databases.
- get server and database access from Rothamstead (who? how? or have them dump?)
run ubuntu NDS VMs or docker instances that can access this data
make ftp address public

National Data Service:

VMs / Dockers that can connect to above filesystem (e.g. mount at /home/data/) and connect to database
- VMs or Dockers have read access to filesystem + Postgres server, and somewhere to write / dump data

terraref / reference-data Goto Github PK

reference-data's People

Contributors

Stargazers

Watchers

Forkers

reference-data's Issues

Overview of Genomics Pipeline

Details

Raw reads + quality scores

Reference genome assembly

Sequence alignment

de novo sequence alignment

SNP and genotype variants

Genomic coordinates

Genome annotations

Visualizing and annotating Genomes

Downstream

Repository

Private / Internal use

Program-wide / Public / Shared

Summary:

Details

Meta-data

Directory Structure

Location Dependent (sensors)

Location Dependent (e.g. Meteorology)

A simulated dataset

A note on variable names

Design of the Simulation Experiment

Overview

Years

Sites:

Genotypes

Phenotypes

Simulated Sensor Data

How to obtain data and give feedback:

Use cases

references

Managing Synonyms:

Radiance data

Variables

Dimensions

Description

Context

Further Suggestions / Request for Feedback

Key Products:

Approach

Notes

Recommend Projects

Recommend Topics

Recommend Org