arms-mbon / data_workspace Goto Github PK
View Code? Open in Web Editor NEWThis repo is the working area for ARMS-MBON event and omics data: data are collected, harvested, combined, reformatted
License: MIT License
This repo is the working area for ARMS-MBON event and omics data: data are collected, harvested, combined, reformatted
License: MIT License
Thing to improve wrt the download from PlutoF as is coded currently (July 27)
If you think we need to discuss a better way to manage this QC of the station and arms unit names, probably better to do so before Aug 12. In the end it may be better to use my input to correct these names, rather than hard-coding, but with alerts for when something new is encountered that I have not yet provided input for
The spreadsheet with the corrections to apply to the station and arms names:
PlutoF_QC_v2_StationARMSnames.csv
It is necessary to catch the following cases in the script where PlutoF metadata are downloaded and subjected to a QC on the station and ARMS unit names
The input to this QC is the CSV file with the name PlutoF_QC_StationsARMSnames.csv and it will be updated by (usually) Katrina, using git history to keep track of changes that will be made to it.
This file has 6 columns: Station, Stations corrected, Country, Country corrected, ARMS unit, ARMS unit corrected. The QC script will change all the Station, ARMS unit, and Country that it encouters in the PlutoF download, to the "corrected" values.
If there are a different number of station+arms units in PlutoF than in this spreadsheet (because more have been added to PlutoF since the spreadsheet has been made) then the QC may not be done fully.
To avoid this the script should do the following:
So, if the QC input is:
Station | Station corrected | Country | Country corrected | ARMS unit | ARMS unit corrected
Koster | | Sweden | | Koster_VH1 | VH1
....
Then the QC output would look something like:
Station | Station corrected | Station QC | Country | Country corrected | Country QC | ARMS unit | ARMS unit corrected | ARMS unit QC
Koster | | passed | Sweden | | passed | Koster_VH1 | VH1 | passed
....
Koster | | | Sweden | | | Koster_VH4 | | new
I see that now the Belgian and one other station have arms units that I do not have in the current QC input file, so if you test it out now there should be some "new"s in the QC report
For the files that you create from the combined PluotF and GS data, can you make the following changes
1- for the file called https://github.com/arms-mbon/data_workspace/blob/main/qualitycontrolled_data/combined/combined_ImageData.csv, can you please only put in there the rows that have "filetype" Image.
2- then, for the file https://github.com/arms-mbon/data_workspace/blob/main/qualitycontrolled_data/combined/combined_SamplingEventData.csv, the column currently called "Number of associated data files" should be changed to "Number of images" and should report the ...number of images
3- but then for the remaining files that are currently in https://github.com/arms-mbon/data_workspace/blob/main/qualitycontrolled_data/combined/combined_ImageData.csv, that are NOT of filetype "image", can you move those to a file called https://github.com/arms-mbon/data_workspace/blob/main/qualitycontrolled_data/combined/combined_OtherDataFiles.csv
4- For the file https://github.com/arms-mbon/data_workspace/blob/main/qualitycontrolled_data/combined/combined_SamplingEventData.csv, in the column "Number of ENA sequences", can you change that to "Sequences available" and rather than reporting the number, report the unique list that is the gene type: so for an event, if there is an ERR number in Gene_COI, Gene_ITS, and Gene_18S, the the list is "COI; ITS; 18S" and accordingly for a different combination of what are available.
Some checkpoints on the PlutoF download and QC script are necessary. The script can continue working, but it should produce a report that is sent to me (can this be done automatically)
The report should include
These are my suggested initial list of checkpoints
For the files in https://github.com/arms-mbon/data_workspace/tree/main/qualitycontrolled_data/combined, each xxxxData.csv file now needs a xxxxData_Metadata.csv file
In the attached I have explained what goes in these metadata files for the 5 files in this folder i GH (note: 5, not 4, see issue #23). For each tab - named after the file it is for - I have included
--> Copy to the xxxData_Metadata.csv file the contents of columns B to H
--> note that for the combined_SamplingEventData.csv file, I have one extra column in my uploaded file - instead of just the column title as in the source plutoF OR GS file, I have added both, because for this one file you get most of the input data from GS, but some instead come from PlutoF. So for this one, the metadata.csv file is created from columns C-I
In the CSVs you create from the plutof download, the column title "Associated date" should be "Associated data"
In all CSVs, where a date is reported, please just the date and not also the time
The ARMS overview googlesheet data need to be downloaded to the ARMS-MBON data repo.
There will then need to be a QC and a combining of that with the PlutoF data, but those two actions will be raised in other issues.
These googlesheets are here
Download the GS data into ARMS github
Now that the image filenames and URLs are accessible from the plutof download as harvested into github (associated_data_[observatory], the information therein needs to be supplimented
In particular we would like
To check out how possible this is to do on the current crop of images
If yes to the above, then @kmexter should go ahead and do the necessary preparation and the @cedricdcc will write the script to impliment these actions
I am looking at the combined GS and Plutof data in the CSV file https://github.com/arms-mbon/Data/blob/main/QualityControlledData/Combined/combined_SamplingEventData.csv.
The ARMS units for RavMarine and RavHarbour are RavH1,2,3 and RavM1,2,3, and this is correct in the googlesheet download in github and the plutof download in github. But in this file they have been translated into RAVH and RAVM in column "UnitID", with the result that we appear to have lost 4 units. This is not correct
The file called GS_ARMS_Material_Samples_and_Sequence_Info.csv should be called GS_ARMS_Material_Samples_and_Sequence_Metadata.csv because it comes from the metadata tab for samples&sequences.
The data file is called GS_ARMS_Material_Samples_Sequence.csv but the metadata file is called GS_ARMS_Material_Samples_and_Sequence_Info.csv --> either with the "and" or without it for both of them please (better without)
BojaVida and LukaKP are the ARMS units belonging to GulfOfPiran, but in the combined sheet they are listed in the Observatory column and GulfOfPiran has gotten lost. The googlesheet and its download are correct, so something has gone wrong here.
create the arms-mbon "data front page"
After the QC has been done (issue nr 13) and I have made any corrections and then the QC re-run and passed (this will be a manual step), then the information in the GS needs to be added to the info in PlutoF. So we make one combined “dataset” here? Here I mean all the data from the GS, not only the data that has been subjected to a QC.
This combined dataset will go in https://github.com/arms-mbon/Data/tree/main/QualityControlledData/Combined.
I don’t have a preference as to how you organise this dataset, but what I will be wanting to be able to show, from that dataset, is the following (as CSV files for now)
In this tab of the arms googlesheet
https://docs.google.com/spreadsheets/d/1j3yuY5lmoPMo91w6e3kkJ6pmp1X6FVGUtLealuKJ3wE/edit#gid=1607535453
Note that there are 4 new columns at the end and these need to be copied (as is) into the observatory info CSV files and from there into the combined observatory CSV also
The scripts that are written to harvest data from plutof or google, can they have names that are consistent and say better (to others) what they do e.g.
Also, the output with the following names: could they be changed in the following way
associated.csv -> AllAssocatiedData
main.csv -> how is this different from "overview"? if it is the same, then delete it please
material_samples.csv -> AllMaterialSamples
observations.csv -> AllObservations
overview.csv -> AllOverview
sequences.csv -> AllSequences
ARMS_data.json -> AllARMSPlutof.json
To create a new output from the plutof download-QC script: this should be called observatory_info.csv and contain, for each station
Just that
And one grand overview one for all stations
thanks!
To make the ARMS images accessible, the following are pre-requisites
During the next ARMS meeting, I should ask what the use-cases will be: what sort of subsetting of images will they want to do when downloading image, will they want to do this even or is one at a time enough, what metadata will they subselect on, will they mind waiting for a request-download-subset (e.g. they say what images they want, launch a request, and we send email when it is ready to access).
Then need to decide what sort of DB (mongoDB?) we can ask for that will do this: will store images, store image metadata, provide URLs for individual download for externals, and provide queries and zip-download for externals
Will gitLFS be needed here? the ARMS images themselves will never change, but do we want to be able to track e.g. use of images (via gitcomments or similar), or related images ...
In https://github.com/arms-mbon/data_workspace/blob/main/QualityControlledData/Combined/combined_SamplingEventData.csv, the rows gotten only from PlutoF have to use the materialSampleID to get the data for the columns: fraction, preservative, filter. This is how you should do it
the new column to add is to be called Sample replicate and you copy the A,B etc into there. For everywhere else you can leave that blank. The columns should go after filter.
In the QC report returned in the comparison of the google sheets to plutof, there are some funnies that need fixing
Looking at https://github.com/arms-mbon/Data/blob/main/QualityControlledData/FromGS/qc_report_arms_observatories_gsheets_to_plutoF.csv
SHNP2,3,4 are reported as failing on the arms_id.
RavHarbour depth says it has both passed and failed, in subsequent rows, which clearly cannot be
Bodo GStraM is repeated twice, as are other rows. Why?
In https://github.com/arms-mbon/data_workspace/blob/main/QualityControlledData/Combined/combined_SamplingEventData.csv, the following columns should have a default value of "not provided"
Fraction
Preservative
Filter
CrateCover
The default value for the following should be 0
Number of associated data files
Number of ENA sequences
For harvesting from the GS, the following changes have since been made
1- there is no longer a column call "Field replicates" (column I) in the ARMS Observatory info sheet
2- there is now a new column called "fieldReplicate" (column AC) in the ARMS samples+sequences sheet. The value here is a string
So this new column will be harvested accordingly in the https://github.com/arms-mbon/data_workspace/tree/main/qualitycontrolled_data/from_gs folder
The consequence of the replicates being with the sampling event data is that this column should not be added to the https://github.com/arms-mbon/data_workspace/blob/main/qualitycontrolled_data/combined/combined_ObservatoryData.csv any more, but rather https://github.com/arms-mbon/data_workspace/blob/main/qualitycontrolled_data/combined/combined_SamplingEventData.csv. Copy it over, as is, into a column with the name FieldReplicate which you can put after the MaterialSampleID column. Where a row is filled in with only plutof information (i.e. that event is not in the googlesheet), please put the value "Not provided" there.
Can you then do a new harvest and in particular also check
When you download from PlutoF and organise here, can you create an automatic README.md text for each station (observatory) folder?
Here are the metadata downloaded from PluotF on [date] for the observatory [observatory name]\n
Each type of PlutoF page is represented by a separate spreadsheet\n
Before these QC scripts are run, it is probably a good idea to do a fresh download of PlutoF (and apply its QC script) and a download from the googlesheets (issue nr 12)
First, the information in Tab 1 of the GS needs to be compared to the information obtained from the QC-corrected PlutoF
QC on Tab 1. I need to know that the list of Observatories and their associated information, in the GS and the list of QC-corrected Observatories, and their associated information, in PlutoF are the same. So the list of observatories needs to be compared between GS and PlutoF (the QC-corrected version of PlutoF), the list of ARMS units, and the lat, long, and depth values for each unit. In order to make this manageable by the script as well as by me (who will have to check the output of the QC and make corrections accordingly), I propose the following.
Where there ARE matches on the Observatory level, then the ARMS units need to be checked, to see if for each observatory there is the same set of ARMS units in the GS as there are in the QC-corrected PlutoF download. That information can go to a file called Tab1_QCunit.csv, and you can have the columns: “Observatory” (since GS and PlutoF are the same, the name can be taken from either), “ARMS (GS)”, and “ARMS (PlutoF)”. This would then be filled in the same way as with the Observatory search: so either with the 2 ARMS unit names where there is a match, or a name and a “no match” where there is no match
Finally, where there are matches on Observatories and ARMS units, then I need the Lat, Long, and Depth values to be compared. I would create a third CSV file, Tab1_QCcoords.csv, and have the columns “Observatory”, “ARMS unit”, and then “Latitude”, “Longitude”, “Depth”. For each observatory-armsunit combo, if the Lat, Long, Depth values in PlutoF and in the GS are the same, you can put “match” in the cells, if they do not match, put “no match” in both cells and I will know to check those out myself.
Send me an email when that is done
DONE
Next, QC on Tab 2
First we need to check again that the observatory and arms units are the same in Tab 2 as they are in Tab 1, and then we need to compare the sampling events in Tab 2 to those in the QC-corrected PlutoF download
Compare the unique Observatory-ID in Tab 2 to those in Tab 1, and for each unique Observatory, compare the mentioned ARMS-IDs to those in Tab 1. Note: it is OK if there are more observatories and arms units in Tab 1 than Tab 2, but not more in Tab 2 than in Tab 1. If there is a mismatch, then report this in Tab2_QCobservatory.csv: “observatory”,”arms unit”: only enter values under these columns if there are NO matches with tab 1
Now compare the Event_ID in the (QC-corrected) PlutoF download with the EventID in the GS. I am not sure of the best way to do this, but what I need is a list of all the EventIDs in GS and whether those same Event_IDs are in PluotF (match/no match), and then if there are any more PlutoF Event_IDs that are not in the GS.
Next we need to compare the list of Material Samples in the GS (MaterialSample-ID) and PlutoF (MaterialSampleID and I know this only exists for some observatories). I think you can do the same as above: make one unique(combined list) and then put that in column 1, with y/n in the “in plutof” and “in GS” columns. Call that Tab2_QCsamples.csv
Send me an email when that is done
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.