dswdejonge / tripled Goto Github PK

1.0 1.0 0.0 8.9 MB

TripleD database NIOZ

R 100.00%

tripled's Introduction

TripleD

R-package to read and transform TripleD sample data to produce a database with presence-absence, density, and biomass data for benthic megafauna. The TripleD is a special quantitative sampling dredge produced and used by the NIOZ Royal Netherlands Institute for Sea Research.

This package contains all the code necessary to set up the TripleD database with time-series data collected by NIOZ. This package does not contain data; data can be requested from the NIOZ Data Archiving System (DAS).

The database output of the package can be visually interacted with using a developed Shiny app.

Installation

You can install the TripleD R-package including the HTML vignette by running the following command (beware you need to have the R-package devtools installed):

#install.packages("devtools")
devtools::install_github("dswdejonge/TripleD", build_vignettes = TRUE)

# Read vignette:
browseVignettes("TripleD") # click HTML

Contructing the database

Set up your working directory as follows in order to start with the NIOZ TripleD database:

Go to the NIOZ Data Archiving System (DAS) and request the formatted TripleD data CSV files.
In your working directory create a folder called 'inputfiles'. Within this folder you have to create two other folders called 'Species' and 'Stations'.
Put the CSVs with all species data in the folder 'inputfiles/Species' and the CSVs with all station data in the folder 'inputfiles/Stations' (do not put any files in these folders that are not csv data files formatted for the TripleD database).
Add and run the following R-script to your working directory:

# Load the library
library(TripleD)

# Loads all CSVs, checks format, and stores an R dataframe 
# in the newly created folder 'data'.
# Should not throw errors if the CSVs taken directly from DAS are used.
construct_database(in_folder = "inputfiles")

# Load the bioconversion CSV and check the input format.
check_bioconversion_input()

# Collect bathymetry from NOAA and taxonomy from WoRMS
collect_from_NOAA()
collect_from_WORMS()

# Prepare the bioconversion file to use (add valid taxon
# names and calculate mean conversions for each higher taxon)
prepare_bioconversion()

# Add extra data to the intial database (taxonomy, water depths
# from bathymetry, track lengths from coordinates and ticks,
# bearings, and ash-free dry weight using conversion data)
complete_database()

# Finalize database, by aggregating data, selecting relevant columns, and
# calculating final densities and biomass per sampling station.
finalize_database()

# There is a database with density and biomass data 
# per taxon per station.
load("database.rda")

# View definition of each database column.
att_database

# Extract a community matrix for ecological analysis,
# e.g. Ash-Free Dry Weight per m2 for all species.
CM <- get_community_matrix(database, "species", "Biomass_g_per_m2")

# There is also a database with individual size measurements 
# and weights per taxon.
load("database_individuals.rda")

# View definition of each database_individuals column.
att_database_individuals

Cheat sheet

If you feel lost with the workflow and all the files, use this cheat sheet:

Size dimension

A quick reference and reminder of which size dimensions are measured for different morphological groups. The names in the diagram correspond exactly with the names in the species CSVs and bioconversion CSV that can be requested from the NIOZ Data Archiving System.

tripled's People

Contributors

Stargazers

Watchers

tripled's Issues

Add function to calculate bearing from start stop lats.

Add function to toolbox package.

Add function to create community matrix for ecological analyses.

Is there an error when incomplete/focus but empty excluded/focus col?

Check abs bearing value is between 0 and 180.

Bearing must lie between 0 and 180.

What happens when different taxa are given the same valid name by WoRMS?

In the original file they may have different reported fractions. Are these carried through the calculations correctly?

Throw error for count = 0

Count = 0 is not allowed, change to NA or to a value (i.e. if 0 was a typo of 10).

Consistency in booleans.

Booleans are now sometimes 1, 0 and NA, or only 1 and NA. Add code to transform all booleans (1/0/NA) to TRUE/FALSE?

Extra station info

Add extra station info from external sources:

Environmental data (temperature, nutrients, etc.)
Add functional traits to species ('Genus traits handbook'?, Imares?)

If weight is partial, use size measurement for biomass.

If there is a reported weight, but is_Partial is true, then use the estimated length using a regression based on the size measurement. If there also is no regression weight, use the partial weights and report the underestimation.

Add net mesh size to station attributes?

Asked

Write test for bioconversion format.

Test with mistakes:

WW_to_AFDW is not a fraction.
only WW_to_AFDW or regression are given.
WW_to_AFDW has different values for the same species.
Taxon name is wrong. -> does it give message and a list of wrong names?

Write format checker for bioconversion.csv file.

Check the format of all data in the bioconversion file.

Is it a problem if only WW_to_AFDW or only regression coefficients are given?

Make bioconversion attributes available in package.

As dataframe with documentation, just like species and stations attributes.

Write documentation for community matrix function.

Include function for transforming coordinates.

Add function to change coordinates based on the EPSG code.

Fix error message.

In file Stations/test_stations_solved.csv, the entries in row(s) 5 are defined 'Focus' in the column 'Cruise_objective', but no excluded taxons are given in the extra column 'Focus'

In file Stations/test_stations_solved.csv, the entries in row(s) 5 are defined 'Focus' in the column 'Cruise_objective', but no taxons that were focussed on are given in the extra column 'Focus'

Check taxa names from excluded/focus against WoRMS.

test: error when incomplete, but no excluded column

If a station_objective is incomplete/focus, but there is no information in excluded or focus column, test if you get an error. Is it possible to have both an excluded and focus column? Make into only one column?

Sum of count and biomass, na.rm = T

Sum up anyway, but keep track of if data is complete or incomplete.
i.e. COUNT: 5 + 6 = 11 (complete), but 5 + 6 + NA = 12 (incomplete - NA is at least 1 )
i.e. BIOMASS: 5 + 6 = 11 (complete), but 5 + 6 + NA = 11 (incomplete - not possible to estimate minimum weight to add )

Test if argument 'write as csv' works.

Test if the workflow functions write csv if asked.

Check taxa names in excluded/focus cols.

Format:
taxon;taxon;taxon

Obtain list of all those taxa, check against WoRMS, i.e. valid names?
Store again in same format
taxon;taxon;etc

Find new place for repository

NIOZ account on GitHub? Open source organization?

Allow different fractions for count = -1.

If eggs of species is found with different fraction than the species itself, allow different fraction.

Write test for community matrix function.

count * frac is not always integer: round

sometimes, the fraction does not result in an integer because the fraction is obscure. Round up/down to the closest integer before using upscaled count for calculations.

Count = -1 are usefull for presence absence

Now they are always removed. Somehow keep for presence-absence data?

Does code work if preference order for AFDW is changed?

Format check if taxa names in cruise objective are separated by a ";".

Already in 1_construct_database?

Make warnings error messages.

Throw error if no value, but other fields has info.

For example, there is no size, but size_dimension is length. Or if there is no weight, but weight type is NA. Throw error, and let user change this manually.

Make nice map with abundances.

Issues now:
Map is small.
Map takes long to render.

Additionally:
Nice to have interactive map (click on datapoint to get metadata).

Cruise objective row reporting + 1

Report difference between measured and calculated values.

Water depth (reported, noaa).
Track length (reported, odometer, coordinates).
Wet weights (reported, regression)
AFDW (reported, from WW, regression)

Write new test file

Size_dimension columns
EntryIDs (uniqueness)

Sort row nrs in error message.

Check if species in list correspond to excluded/focus

If according to station_objective a taxon was excluded, but a reported species falls in this taxon, something if wrong. Throw error. (Same logic for 'focus').

Where/how to store description of how CSVs were created.

This is a manual process. Should be documented how raw data is used to get to the required fields for the TripleD database as specified in the attributes files. However, how and where should this information be stored? It can be used for multiple purposes; either to retrace how data came to be or to use as an example when collecting and cleaning your own data.

Thoughts:

A vignette with all headers as specified in the attributes file, and people can freely write how they altered their dataset for use in the database. ++ all information in one location, inclusion in the package. -- difficult to review the changes per csv file, file format (rmarkdown) might not be very time resistent, perhaps a barrier for some to use (hidden in package, use rstudio to edit).
A csv or text file with the same name as the stations or species csv with the suffix "_metadata" or something. ++Can be stored together with the CSV in DAS, easy to create and edit. -- Not all info in the same place (if one column in the database appears weird, you can review the assumptions for each CSV in one location).

fraction is assumed,
the reported weight is partial,
the data is calculated instead of measured (order of preference in combine data sources)