arms-mbon / data_workspace Goto Github PK

View Code? Open in Web Editor NEW

2.0 3.0 1.0 40.2 MB

This repo is the working area for ARMS-MBON event and omics data: data are collected, harvested, combined, reformatted

License: MIT License

Python 100.00%

dna ro-crate biodiversity-data biodiversity-observation-network

data_workspace's People

Contributors

Stargazers

Watchers

Forkers

biomobst

data_workspace's Issues

QC on the PlutoF download: station and arms unit names

Thing to improve wrt the download from PlutoF as is coded currently (July 27)

I had sent you a spreadsheet with the corrected names of the stations, please include those corrections: for example Swedish_West_coast should be "SWC". The hard-coded syntax corrections you have in that code are otherwise OK and should remain as such, only the station names need to be taken from this spreadsheet
Please add an error check/alert to your download: if a station is downloaded that is not in my spreadsheet, inform me and I will update the spreadsheet. Maybe we should put that correction spreadsheet somewhere in GitHub (rather than attached to this issue)?
In addition to the other hard-coded syntax fixes you implement in your code, please add a "remove spaces": the gulf of piran arms units called Luka KP and Boja Vida, for example, need to be changed to LukaKP and BojaVida
I think your code is, however, too harsh in that the station called Greenland has now arms units called 1 2 3 1 2 3, rather than Nuuk1/2/3 and Danebork1/2/3. Please fix this but in a way that is future-proof, not just for this next download
The same comment for the station called Laeso

If you think we need to discuss a better way to manage this QC of the station and arms unit names, probably better to do so before Aug 12. In the end it may be better to use my input to correct these names, rather than hard-coding, but with alerts for when something new is encountered that I have not yet provided input for

The spreadsheet with the corrections to apply to the station and arms names:
PlutoF_QC_v2_StationARMSnames.csv

Error catches to add to the PlutoF download and QC script

It is necessary to catch the following cases in the script where PlutoF metadata are downloaded and subjected to a QC on the station and ARMS unit names

The input to this QC is the CSV file with the name PlutoF_QC_StationsARMSnames.csv and it will be updated by (usually) Katrina, using git history to keep track of changes that will be made to it.
This file has 6 columns: Station, Stations corrected, Country, Country corrected, ARMS unit, ARMS unit corrected. The QC script will change all the Station, ARMS unit, and Country that it encouters in the PlutoF download, to the "corrected" values.
If there are a different number of station+arms units in PlutoF than in this spreadsheet (because more have been added to PlutoF since the spreadsheet has been made) then the QC may not be done fully.
To avoid this the script should do the following:

check that all the "stations" downloaded from PlutoF are in the spreadsheet, and that the name in the Stations column matches the name in PlutoF exactly
for each station, check that all the "ARMS units" downloaded from PlutoF are in the spreadsheet, and that the name in the ARMS unit column matches the name in PlutoF exactly
produce a QC report, called "PlutoF_HarvestQCreport.csv" which can be a copy of the contents of "PlutoF_QC_StationsARMSnames.csv" with additional columns
Station ; Station corrected; Station QC; Country; Country corrected; Country QC; ARMS unit; ARMS unit corrected; ARMS unit QC

If the station and/or country and/or arms unit matches in PlutoF with the input from PlutoF_QC_StationsARMSnames.csv: the entry in the respective QC column of PlutoF_HarvestQCreport.csv is "passed"
If an entry in PlutoF_QC_StationsARMSnames.csv is not found in PlutoF: the entry in the respective QC column is "not passed"
If there is an entry in PlutoF (a station, or a country, or a unit) that is not in PlutoF_QC_StationsARMSnames.csv, then add a new row to PlutoF_HarvestQCreport.csv with the appropriate value of the "station" "country" and "ARMS unit" added, and with the QC column having the word "new" in there for the part that is new (be that station and/or country and/or unit)

I see that now the Belgian and one other station have arms units that I do not have in the current QC input file, so if you test it out now there should be some "new"s in the QC report

modifications to the combined CSVs

For the files that you create from the combined PluotF and GS data, can you make the following changes
1- for the file called https://github.com/arms-mbon/data_workspace/blob/main/qualitycontrolled_data/combined/combined_ImageData.csv, can you please only put in there the rows that have "filetype" Image.
2- then, for the file https://github.com/arms-mbon/data_workspace/blob/main/qualitycontrolled_data/combined/combined_SamplingEventData.csv, the column currently called "Number of associated data files" should be changed to "Number of images" and should report the ...number of images
3- but then for the remaining files that are currently in https://github.com/arms-mbon/data_workspace/blob/main/qualitycontrolled_data/combined/combined_ImageData.csv, that are NOT of filetype "image", can you move those to a file called https://github.com/arms-mbon/data_workspace/blob/main/qualitycontrolled_data/combined/combined_OtherDataFiles.csv
4- For the file https://github.com/arms-mbon/data_workspace/blob/main/qualitycontrolled_data/combined/combined_SamplingEventData.csv, in the column "Number of ENA sequences", can you change that to "Sequences available" and rather than reporting the number, report the unique list that is the gene type: so for an event, if there is an ERR number in Gene_COI, Gene_ITS, and Gene_18S, the the list is "COI; ITS; 18S" and accordingly for a different combination of what are available.

Check-points to code into the PlutoF download

Some checkpoints on the PlutoF download and QC script are necessary. The script can continue working, but it should produce a report that is sent to me (can this be done automatically)

The report should include

Date
Which checkpoint this report is about (I have given the checkpoints names)
Whether the script could execute completely or not

These are my suggested initial list of checkpoints

"Checkpoint: station names". Are there stations or arms units that are in the new plutof download that are not in the QC-spreadsheet that I provide (the one with the corrections to those names). report which PlutoF station is missing
"Checkpoint: number of stations pages". Are there fewer downloaded pages that in the previous download. Do this station by station, i.e. the report should include "Station XXX has ## fewer plutof pages than the previous download". This is in order for me to see that something has been deleted, which normally is a big no-no.
sub"Checkpoint: missing station pages": if you can do this, then tell me which type of pages are missing for that station, e.g. material samples, sequences, associated data, sites, subsites (arms units), and/or events
"Checkpoint: arms units renamed". For each station, are the names of the arms units (subsites) different from before? This is a nice2have, not a must have: I ask it because some @#$@#$ made changes without telling me, and for us this is a big problem that I really do need to know about

Metadata.csv files for the combined data

combined_metadatasheets.xlsx

For the files in https://github.com/arms-mbon/data_workspace/tree/main/qualitycontrolled_data/combined, each xxxxData.csv file now needs a xxxxData_Metadata.csv file
In the attached I have explained what goes in these metadata files for the 5 files in this folder i GH (note: 5, not 4, see issue #23). For each tab - named after the file it is for - I have included

the column title as found in its source file (being in https://github.com/arms-mbon/data_workspace/tree/main/qualitycontrolled_data/from_plutof or https://github.com/arms-mbon/data_workspace/tree/main/qualitycontrolled_data/from_gs)
the column title as it should be in the xxxData.csv file -> you hard-coded these before, it is probably better to take them from here (I changed a very few of these titles)
the definition, data type, property, obsrevatory property URL, unit, unit URL

--> Copy to the xxxData_Metadata.csv file the contents of columns B to H
--> note that for the combined_SamplingEventData.csv file, I have one extra column in my uploaded file - instead of just the column title as in the source plutoF OR GS file, I have added both, because for this one file you get most of the input data from GS, but some instead come from PlutoF. So for this one, the metadata.csv file is created from columns C-I

Error in the combined spreadsheets

modify PlutoF QC

Syntax errors in the plutoF download

In the CSVs you create from the plutof download, the column title "Associated date" should be "Associated data"

In all CSVs, where a date is reported, please just the date and not also the time

Adding the ARMS googlesheet data to here

The ARMS overview googlesheet data need to be downloaded to the ARMS-MBON data repo.
There will then need to be a QC and a combining of that with the PlutoF data, but those two actions will be raised in other issues.

These googlesheets are here

Download the GS data into ARMS github

Grab tab 1 and copy it to https://github.com/arms-mbon/Data/tree/main/QualityControlledData/FromGS to a CVS called “ARMS_ObservatoryCoordinates.csv”. Note that there are multi-element strings in some rows (H) which use ; as a separator, so best to use comma for the CSV separator here.
Grab tab 2 and copy it to https://github.com/arms-mbon/Data/tree/main/QualityControlledData/FromGS to a CVS called “ARMS_SamplesAndOmics.csv”. Ignore the first column and the last 20 (from “initial number of paired-end reads” to the end).

more metadata for the arms images

Now that the image filenames and URLs are accessible from the plutof download as harvested into github (associated_data_[observatory], the information therein needs to be supplimented
In particular we would like

the plate number and side (and if there are >1image for a plate and side, that is OK, maybe add an iterator tho)
if this is a field image instead

To check out how possible this is to do on the current crop of images

are the filenames presented well enough to get this info from there?
are enuf events provided with an image-explanation spreadsheet, well enough formatted, that this can be used
perhaps may need to do some manual work, e.g. indicating where to use the filename, where to use a CSV file, and perhaps even myself editing those CSV files where they are not quite right
what other metadata from the images pages on plutof do we want to add to this spreadsheet, so they can be made visible (as the metadata page on plutof for each image is not downloadable from outside, only the actual image is). is there more metadata there than is in the .jpg images themselves anyway?

If yes to the above, then @kmexter should go ahead and do the necessary preparation and the @cedricdcc will write the script to impliment these actions

Bugs in combined data to be fixed ASAP

I am looking at the combined GS and Plutof data in the CSV file https://github.com/arms-mbon/Data/blob/main/QualityControlledData/Combined/combined_SamplingEventData.csv.

The ARMS units for RavMarine and RavHarbour are RavH1,2,3 and RavM1,2,3, and this is correct in the googlesheet download in github and the plutof download in github. But in this file they have been translated into RAVH and RAVM in column "UnitID", with the result that we appear to have lost 4 units. This is not correct

The file called GS_ARMS_Material_Samples_and_Sequence_Info.csv should be called GS_ARMS_Material_Samples_and_Sequence_Metadata.csv because it comes from the metadata tab for samples&sequences.
The data file is called GS_ARMS_Material_Samples_Sequence.csv but the metadata file is called GS_ARMS_Material_Samples_and_Sequence_Info.csv --> either with the "and" or without it for both of them please (better without)

BojaVida and LukaKP are the ARMS units belonging to GulfOfPiran, but in the combined sheet they are listed in the Observatory column and GulfOfPiran has gotten lost. The googlesheet and its download are correct, so something has gone wrong here.

data.arms-mbon.org to be created

create the arms-mbon "data front page"

After QC, combine the googlesheet and PlutoF data

After the QC has been done (issue nr 13) and I have made any corrections and then the QC re-run and passed (this will be a manual step), then the information in the GS needs to be added to the info in PlutoF. So we make one combined “dataset” here? Here I mean all the data from the GS, not only the data that has been subjected to a QC.

This combined dataset will go in https://github.com/arms-mbon/Data/tree/main/QualityControlledData/Combined.
I don’t have a preference as to how you organise this dataset, but what I will be wanting to be able to show, from that dataset, is the following (as CSV files for now)

- ObservatoryInfo.csv: Observatory, Unit, Latitude, Longitude, Depth (m), Field replicates, Monitoring Area, Habitat keywords. These latter 3 are from the GS, the rest are from either (as they should be corrected to be the same by now)
- SamplingEventInfo.csv: EventID, MaterialSampleID, Observatory, Unit, Date Deployed, Date Collected, and then columns that come from GS but with the titles: “Fraction”, “Preservative”, “Filter (micrometre)”, “Crate cover”, number of image data, number of ena data. The master list of material sample IDs to follow (i.e. to know how many rows need to be added) should be that combination of GS and PlutoF here also (there could be more in the GS than in PlutoF, and that is OK).
- OmicsInfo.csv: MaterialSampleID, and then columns from the GS: “MaterialSampleID original”, “gene COI”, “demultiplexed COI”, “negativeControl gene COI”, “gene ITS”, “demultiplexed ITS”, “negativeControl gene ITS”, “gene 18S”, “demultiplexed 18S”, “negativeControl gene 18S”
- Image data

new columns in the arms googlesheet

In this tab of the arms googlesheet
https://docs.google.com/spreadsheets/d/1j3yuY5lmoPMo91w6e3kkJ6pmp1X6FVGUtLealuKJ3wE/edit#gid=1607535453
Note that there are 4 new columns at the end and these need to be copied (as is) into the observatory info CSV files and from there into the combined observatory CSV also

new destination for plutof download

Naming for data download scripts and I/O

The scripts that are written to harvest data from plutof or google, can they have names that are consistent and say better (to others) what they do e.g.

PluotF_harvestAndQC
GoogleSheet_harvestAndQC (when that one is written)

Also, the output with the following names: could they be changed in the following way
associated.csv -> AllAssocatiedData
main.csv -> how is this different from "overview"? if it is the same, then delete it please
material_samples.csv -> AllMaterialSamples
observations.csv -> AllObservations
overview.csv -> AllOverview
sequences.csv -> AllSequences

ARMS_data.json -> AllARMSPlutof.json

add these information to a new PlutoF download output file

To create a new output from the plutof download-QC script: this should be called observatory_info.csv and contain, for each station

Station, Country, ARMS_unit (these exactly the same as in overview_data.csv)
then Latitude, Longitude, and Depth (Depth taken from the metadatum, in the page for each arms unit/subsampling area, called Depth max (unless blank, in which case Depth min, unless blank, then "not provided")

Just that
And one grand overview one for all stations
thanks!

What to do with ARMS images?

To make the ARMS images accessible, the following are pre-requisites

the overview of images (e.g. https://github.com/arms-mbon/Data/blob/main/QualityControlledData/FromPlutoF/Koster/associated_data_Koster.csv) should have more metadata in it (see issue#10)
the table needs to provide a clickable link for people to download individual images
people need to be able to select subsets of images and download them all
the downloaded images should have the correct filename (see the CSV file), not the silly long string that PlutoF provides

During the next ARMS meeting, I should ask what the use-cases will be: what sort of subsetting of images will they want to do when downloading image, will they want to do this even or is one at a time enough, what metadata will they subselect on, will they mind waiting for a request-download-subset (e.g. they say what images they want, launch a request, and we send email when it is ready to access).

Then need to decide what sort of DB (mongoDB?) we can ask for that will do this: will store images, store image metadata, provide URLs for individual download for externals, and provide queries and zip-download for externals

Will gitLFS be needed here? the ARMS images themselves will never change, but do we want to be able to track e.g. use of images (via gitcomments or similar), or related images ...

How to deconstruct the material sample id from plutof to get information

In https://github.com/arms-mbon/data_workspace/blob/main/QualityControlledData/Combined/combined_SamplingEventData.csv, the rows gotten only from PlutoF have to use the materialSampleID to get the data for the columns: fraction, preservative, filter. This is how you should do it

a materialsampleID as gotten from PlutoF is constructed with the following elements: string _ string _ string _ date _ date _ possiblestring _ possiblestring _ possible string
the information you need are in these last 3 possible string. so count from the FIRST string, and go forward 5 (i.e. until the final date)
then extract out the final elements, whether there is 0,1,2 or 3 of them, and pass them each thru a string match. I recommended always dropping to lower (or upper) case when doing this, since there could be an incorrect mix of upper and lower cases. but do return to the original case when copying data into the combined file.

IF the string is MF### then fraction=Motile, filter = ###
IF the string is SF### then the fraction=Sessile, filter = ###
IF the string is just MF or just SF (and no number), then the fraction= Motile or Sessile accordingly
IF the string is SED then the fraction=Sediment; IF the string is SED### then fraciton=Sediment and filter = ###
IF the string is PS then the fraction=Plankton; IF the string is PS### then fraciton=Plankton and filter = ###
IF no MF, SED, PF, SF (with or without numbers) is encountered, fraction=not provided
IF the string has the phrase DMSO in it (note phrase - because this can be dmso or dmso-, then this string goes into preservative
IF the string has the phrase ETOH in it, then this string goes into preservative
IF no DMSO or ETOH has been encountered, fraction=DMSO (yes, this is the default value)
IF the string is one letter (A,B,C, etc), then this needs to go into a new column (see point below)

the new column to add is to be called Sample replicate and you copy the A,B etc into there. For everywhere else you can leave that blank. The columns should go after filter.

To be fixed by Q2 2023

In the QC report returned in the comparison of the google sheets to plutof, there are some funnies that need fixing
Looking at https://github.com/arms-mbon/Data/blob/main/QualityControlledData/FromGS/qc_report_arms_observatories_gsheets_to_plutoF.csv

SHNP2,3,4 are reported as failing on the arms_id.

The report says that the value for these in PlutoF is Ven2 which is incorrect
They do actually exist in PlutoF, it is just that they do not have any events. But I want to compare the observatories here, not the events, so better is to compare "sample site" information from PlutoF to GS

RavHarbour depth says it has both passed and failed, in subsequent rows, which clearly cannot be

Bodo GStraM is repeated twice, as are other rows. Why?

Combined spreadsheet blank cell value

In https://github.com/arms-mbon/data_workspace/blob/main/QualityControlledData/Combined/combined_SamplingEventData.csv, the following columns should have a default value of "not provided"
Fraction
Preservative
Filter
CrateCover

The default value for the following should be 0
Number of associated data files
Number of ENA sequences

Modifications to the PlutoF and GS harvesting and combining

For harvesting from the GS, the following changes have since been made
1- there is no longer a column call "Field replicates" (column I) in the ARMS Observatory info sheet
2- there is now a new column called "fieldReplicate" (column AC) in the ARMS samples+sequences sheet. The value here is a string

So this new column will be harvested accordingly in the https://github.com/arms-mbon/data_workspace/tree/main/qualitycontrolled_data/from_gs folder

The consequence of the replicates being with the sampling event data is that this column should not be added to the https://github.com/arms-mbon/data_workspace/blob/main/qualitycontrolled_data/combined/combined_ObservatoryData.csv any more, but rather https://github.com/arms-mbon/data_workspace/blob/main/qualitycontrolled_data/combined/combined_SamplingEventData.csv. Copy it over, as is, into a column with the name FieldReplicate which you can put after the MaterialSampleID column. Where a row is filled in with only plutof information (i.e. that event is not in the googlesheet), please put the value "Not provided" there.

Can you then do a new harvest and in particular also check

that there are no repeat BelgiumCoast rows where the country is Belgium (from PlutoF) rather than BEL (combined PF and GS)

automatic README.md

When you download from PlutoF and organise here, can you create an automatic README.md text for each station (observatory) folder?

Here are the metadata downloaded from PluotF on [date] for the observatory [observatory name]\n
Each type of PlutoF page is represented by a separate spreadsheet\n

Overview: a summary of the arms units (sites), events, and the number of associated data, material samples, sequences, and observations that are present
One CSV each for the: material samples, the associated data, the sequences, and the observations

QC steps to be performed on the ARMS google sheet when compared to the ARMS PlutoF

Before these QC scripts are run, it is probably a good idea to do a fresh download of PlutoF (and apply its QC script) and a download from the googlesheets (issue nr 12)

First, the information in Tab 1 of the GS needs to be compared to the information obtained from the QC-corrected PlutoF
QC on Tab 1. I need to know that the list of Observatories and their associated information, in the GS and the list of QC-corrected Observatories, and their associated information, in PlutoF are the same. So the list of observatories needs to be compared between GS and PlutoF (the QC-corrected version of PlutoF), the list of ARMS units, and the lat, long, and depth values for each unit. In order to make this manageable by the script as well as by me (who will have to check the output of the QC and make corrections accordingly), I propose the following.

Start on observatory IDs/names: is the set of observatories in PlutoF (QC-corrected already) and the GS the same and where do they differ?.

For example, let us ask: there is a Koster from the GS, is there a Koster in the PlutoF (yes); is the kOSTer from the GS also in PlutoF as kOSter from PlutoF (no); the BigStation (a made-up name) from PlutoF, is that also in the GS (no)
So, for each observatory in GS, is it in PlutoF, and for the remaining observatories in PlutoF, are they in GS? Match should be exact - including case and spaces or other characters.
Where there are no matches, I need to know this. Create a CSV file called Tab1_QCobservatory.csv for this information. In there could be the columns: “observatory (GS)”, “observatory (PlutoF)”, and you write in these columns the matched values, and when there is no match instead you write the single value (either from GS or PlutoF) with the word “no match” for the other value. To be more clear, for the example above the spreadsheet would look like this: row1: Koster, Koster; row2: kOSTer, no match; row3: no match, BigStation

Where there ARE matches on the Observatory level, then the ARMS units need to be checked, to see if for each observatory there is the same set of ARMS units in the GS as there are in the QC-corrected PlutoF download. That information can go to a file called Tab1_QCunit.csv, and you can have the columns: “Observatory” (since GS and PlutoF are the same, the name can be taken from either), “ARMS (GS)”, and “ARMS (PlutoF)”. This would then be filled in the same way as with the Observatory search: so either with the 2 ARMS unit names where there is a match, or a name and a “no match” where there is no match
Finally, where there are matches on Observatories and ARMS units, then I need the Lat, Long, and Depth values to be compared. I would create a third CSV file, Tab1_QCcoords.csv, and have the columns “Observatory”, “ARMS unit”, and then “Latitude”, “Longitude”, “Depth”. For each observatory-armsunit combo, if the Lat, Long, Depth values in PlutoF and in the GS are the same, you can put “match” in the cells, if they do not match, put “no match” in both cells and I will know to check those out myself.
Send me an email when that is done

DONE

Next, QC on Tab 2
First we need to check again that the observatory and arms units are the same in Tab 2 as they are in Tab 1, and then we need to compare the sampling events in Tab 2 to those in the QC-corrected PlutoF download

Compare the unique Observatory-ID in Tab 2 to those in Tab 1, and for each unique Observatory, compare the mentioned ARMS-IDs to those in Tab 1. Note: it is OK if there are more observatories and arms units in Tab 1 than Tab 2, but not more in Tab 2 than in Tab 1. If there is a mismatch, then report this in Tab2_QCobservatory.csv: “observatory”,”arms unit”: only enter values under these columns if there are NO matches with tab 1
Now compare the Event_ID in the (QC-corrected) PlutoF download with the EventID in the GS. I am not sure of the best way to do this, but what I need is a list of all the EventIDs in GS and whether those same Event_IDs are in PluotF (match/no match), and then if there are any more PlutoF Event_IDs that are not in the GS.

So maybe make a unique list by combining the two lists of EventIDs
Then write this list out together with match/no match info: 1st column “event id”, 2nd column “in plutof” (y/n), 3rd column“in gs” (y/n). Call this Tab2_QCevents.csv

Next we need to compare the list of Material Samples in the GS (MaterialSample-ID) and PlutoF (MaterialSampleID and I know this only exists for some observatories). I think you can do the same as above: make one unique(combined list) and then put that in column 1, with y/n in the “in plutof” and “in GS” columns. Call that Tab2_QCsamples.csv
Send me an email when that is done