Giter Site home page Giter Site logo

cidgoh / dataharmonizer Goto Github PK

View Code? Open in Web Editor NEW
89.0 7.0 23.0 49.61 MB

A standardized browser-based spreadsheet editor and validator that can be run offline and locally, and which includes templates for SARS-CoV-2 and Monkeypox sampling data. This project, created by the Centre for Infectious Disease Genomics and One Health (CIDGOH), at Simon Fraser University, is now an open-source collaboration with contributions from the National Microbiome Data Collaborative (NMDC), the LinkML development team, and others.

License: MIT License

JavaScript 84.67% CSS 0.93% HTML 7.52% Python 6.51% Makefile 0.37%
data harmonization spreadsheet javascript application linkml linkml-schema

dataharmonizer's People

Contributors

cmrn-rhi avatar cmungall avatar cpauvert avatar ddooley avatar dependabot[bot] avatar griffie avatar ivansg44 avatar kennethbruskiewicz avatar mgopez avatar pkalita-lbl avatar subdavis avatar sujaypatil96 avatar takadonet avatar turbomam avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

dataharmonizer's Issues

Imported csv date validation problems.

Dates imported in the appropriate ISO 8601 standards are being flagged as invalid and sometimes reformatted into an invalid format and/or a different date upon upload.

EXAMPLES:

  • "2020" (YYYY valid format) is being flagged invalid.
  • "2020-04" (YYYY-MM valid format) converted to "3/31/20" upon import.
  • "2020-01" (YYYY-MM valid format) converted to "12/31/19" upon import.
  • **4/4/2020"" (invalid format) converted to "4/4/20".
  • "2019-12-20" (YYYY-MM-DD valid format) converted to "12/19/19" upon import.

CNPHI Export - Travel History

Incorrect mappings for travel history.

DataHarmonizer: destination of most recent travel (city)
is mapping to...
CNPHI: Country of Travel

DataHarmonizer: destination of most recent travel (country)
is mapping to...
CNPHI: City of Travel

BEFORE (DataHarmonizer format):
image

AFTER (CNPHI export):
image

CNPHI Export... Patient Travelled

If a there is data in the DataHarmonizer template under travel history shouldn't the CNPHI export say yes under Patient Travelled? Even when there is no data in the Country of Travel\|Province of Travel\|City of Travel\|Travel start date\|Travel End Date field; it could be that a user didn't fill in those optional fields and just chose to use travel history.

image

`Export To... CNPHI` error

DataHarmonizer Release 0.13.1

Tried exporting the validTestData.csv to CNPHI .xls format and ended up with a column variable that looks like it's supposed to be two columns:

Related Specimen ID\|Related Specimen Relationship Type

...and cell contents that don't appear valid (unless this is what CNPHI requests to have as a null value): |

image

Body product term listed under anatomical material

Body product "Fluid (seminal)" is being labelled invalid. Appears to still be listed under "anatomical material" rather than "body product" in data.js. Is correctly listed under body product in data.tsv.

DataHarmonizer Provenance in CNPHI export

CNPHI has agreed that we can output the DataHarmonizer provenance version information into a (free-text) column called additional comments.


Example DH-CNPHI Export CSV:

image


Example Successful CNPHI Upload:

image

Valid imported numbers flagged invalid

All imported passage numbers (from a csv file) were displayed as invalid. If I entered the cell, did not change any values in it and then validated, the numbers were considered valid.

The same thing happened for Number Base Pairs, Consensus Genome Length, Mean Contig Length, and N50.

Ns per 100 kbp immediately recognized all values (int and float) as valid, and recognized non-numerals as invalid (as it should).

Negative numbers not being invalidated

Fields tagged as "decimal" and "integer" allow negative values. This does not make sense for any of the current fields.

In addition, the reference guide for several of these fields tells users to insert a generic "numerical value". I don't know if its common sense, but stating "positive numerical value" would be more accurate. "Positive integer value" would be even more accurate for integer fields.

"decimal" and "integer" are also not be the best datatype values if we only allow positive values. Wikipedia has a list of descriptors for sets of numbers: https://en.wikipedia.org/wiki/List_of_types_of_numbers. Possible improvements on "integer" are "whole numbers", "natural numbers", and "non-negative integers". Possible improvements on "decimal" are "positive numbers" or "positive decimals". However, I don't believe the users ever see the datatype values, so this is non-important from a ui point-of-view.

DataHarmonizer Version Information in Output Files

I know there is currently discussion on how to add version metadata to DataHarmonizer output files; but while a more sophisticated implementation is being determined - perhaps for the time being commented metadata values should be added directly onto the spreadsheet (e.g. in one of the empty cells in the 'database identifiers' row) to ensure there is some version information on files for current users.

import feedback when user upload fails

It would be useful to get some sort of visual feedback/prompt when an import fails and perhaps with troubleshooting suggesting and/or linking to the SOP. Maybe some sort of check that lets the user know they are missing the “database identifiers” row or a column header.

Additionally, is it necessary for the first row to specifically be labelled “database identifiers”, can’t the contents be ignored?

Non-Frozen Cols not matching Frozen Cols size

Release 0.6.0: The non-frozen column rows do not resize to match the contents/height of the frozen column ‘specimen collect sample ID’ rows. I know this might sound confusing so you can reference figures here for further clarification.

Host Age Bin Paste Issue

I've noticed that if I type in 12 months into host age and host age unit, then host age bin will produce the correct bin (0-9), but if I paste it in I get the wrong bin (10-19) and it considers it valid (See picture).

image

Improve mapping to GISAID field "Passage details/history"

Prepend specimen processing if "virus passage" was selected.

Replace all "unknown" values with blanks. So no more "lab host;unknown;passage method".

Suggested algorithm:

If virus passage is selected from specimen processing:
  Add virus passage

If lab host is selected and not null value:
  Add lab host
If passage number is selected and not a null value:
  Add passage number
If passage method is selected and not a null value:
  Add passage method`

COVID-19 Vocab Update

COVID-19 vocab update due to changes in DataHarmonizer Templates "CanCOGEn-COVID" tab. To be completed before next release.

Field: geo_loc_name (province/territory)
Value: Yukon
Comments: Changed from "Yukon Territory" to "Yukon" to be compatible with CNPHI upload.

Field: signs and symptoms
Value: Abnormal lung auscultation
Comments: Corrected spelling mistake "ausculation" -> "auscultation".

Date validation showing YYYY-MM imports as invalid

All dates imported with the YYYY-MM format were designated invalid when they should be valid. Occurred for all three file types (csv, tsv, xlsx).

Dates Tested: 2019-12, 2020-05, 2020-03, 2020-01, 2020-04.

Import fails if first row empty

If the user leaves the first row empty, lists headers in the second row, and declares that the headers start on the second row - the import will fail. This happens for all three file types (csv, tsv, xlsx).

Improve performance with large datasets

With the potential of several thousand samples needing validation from stakeholders in Quebec and Ontario, we should improve performance for larger datasets. Telling users to break their data up into chunks should be a last resort.

Scrolling through large datasets is fine, due to only a subset of the data being rendered.

Importing, saving and exporting large datasets freezes the page for a significant amount of time. The bottleneck seems to be the ability of Sheet-JS to read/write large files. Sheet-JS has a XLSX.writeFileAsync file that's worth investigating. For reading, perhaps we can break the binary string Filereader obtains into chunks and have Sheet-JS read them in parallel using promises.

Validating large datasets freezes the page for an incredibly significant amount of time. The bottleneck seems to be with Handsontable's updateSettings function. Perhaps we can take a different approach, that involves iterating through the matrix in parallel and making calls to hot.setCellMeta.

Webworkers could also be useful. We could offload the tasks to the background and provide a loading dialog. Webworkers don't work in Chrome offline, but there may be a workaround https://stackoverflow.com/a/33432215/11472358

Post Selection Pick-List Shrinkage

Once something is selected from a pick-list, if you try to change it (say you clicked the wrong one by accident) the drop-down menu only shows your current selection, and any existing subclasses, as options. Re-selecting doesn't remove the current selection either.

If this cannot be changed it should be noted somewhere in the SOP (as much as it may seem obvious to some that the user should just delete it - causing the pick-list to reform - this may not occur to all users).

Improve Spreadsheet Navigation

It would be nice to have a way to search for and go to a column based on it's header.
With the headers not being alphabetical, and ctrl+f only working for the text currently displayed in the viewport, it can be frustrating trying to navigate to a specific column. E.g. I know I want to edit 'host age' but I can't recall where it is in the spreadsheet, so I have to slowly scroll through the sheet to find it. This gets tedious quickly and will only get more frustrating overtime as we add more fields.

Is there a way we could have a header search of some sort? When the user types input, it pulls from a header pick-list and then takes you to your selection. Perhaps using something like scrollViewportTo() - with the header names mapped to a header col number that it jumps you to?
scrollViewportTo jsfiddle example

CNPHI Export - .CSV option

The DataHarmonizer allows for export to the CNPHI template in .xls, but CNPHI only accepts .csv
Would be very helpful to have the option to export to .csv, ideally as the CNPHI export default.

⭐ Thanks to Sarah Savić Kallesøe for identifying this issue!

data loss when importing with 1st row column headers

If the user imports a file where the headers are on the first row, and they declare so on the prompt (‘Which row in your file has the column headers?’ ‘1’) then the second row of the document (first row of data) is truncated (i.e. the row is gone, an empty row does not remain in place). This happens for all three file types (csv, tsv, xlsx).

Invalid Data Notification on Save/Export

Could have a notification for when a user tries to export/save a validated spreadsheet that has invalid fields. We don't want to prevent export, but this could be helpful for users who may have missed correcting a field by accident. Perhaps the notification could have a 'more info' option that would then list out the invalid cells.

Accession Prefix Validation

Implement validation on accessions that have controlled prefixes.

BioProject Accession - Prefix: PRJNA
BioSample Accession - Prefix: SAMN
SRA (run) Accession - Prefix: SRR
GISAID Accession - Prefix: EPI_ISL_

GenBank Accessions - Allowable prefixes for nucleotide direct submissions: U, AF, AY, DQ, EF, EU, FJ, GQ, GU, HM, HQ, JF, JN, JQ, JX, KC, KF, KJ, KM, KP, KR, KT, KU, KX, KY, MF, MG, MH, MK, MN, MT (source: https://www.ncbi.nlm.nih.gov/Sequin/acc.html)

High Priority: Collection Date - level of precision

For incomplete collection dates (to year, or to month) we need a "Date Unit" field with values "Year", "Month" and "Day".
In the validation step, if a collection date only specifies a month or year, the Date Unit field will specify that. Then the DH should automate the filling in the rest of the missing date parts with "01" so that the date can be accepted by downstream programs that require year-month-day (YYYY-MM-DD).
In the export file for CNPHI, the Date Unit field should be called Precision. We'll map that once the DH adopts the changes above.

Blank rows between data are not read

We currently ignore blank rows when reading data, which helps reduce the side of data we're working with (there are sometimes a lot of trailing empty rows). But blank rows between non-blank rows are ignored too. They should not be.

include some example datasets

Can we include some example files that people can use to make sure the validator behalf as it should be? I think we should include a good excel file, a good text file, and a file with some errors in it. Thoughts?

CNPHI Export Header Issues - sequencing centre, primary specimen id, #version

Issues with DataHarmonizer -> CNPHI Export Headers:

  • Does not include "Sequencing Centre" field which is mandatory for CNPHI upload.
  • CNPHI will not accept the field name "primary specimen identification number", it is expecting "primary specimen id".
  • CNPHI will not accept uploads that have the DataHarmonizer version information, e.g. "#validated using data harmonizer version 0.13.1". At least, not on the header row.

Failed Import Notification

When an import fails (e.g. someone is missing a row and then fails to correctly declare which row has the column headers) it would be nice to have a notification of failed import. Especially since a failed import doesn’t show a new/empty sheet if the user has already been using it to validate a previous one. Depending on the user's attention to detail, they may not recognize that they are still working on their previous upload and proceed to work/export as if they were on the expected import.

Maintain extra fields on upload

Potential issue: If we insert the extra fields close to the same index they were originally, the first header row would become inaccurate. Could highlight fields to indicate the first header has no bearing on them. Not a super clean solution.

Adding age bin 80-89 and 90+

Is it possible to add an age bin for 80-89 and another for 90+?

Our partners have indicated that this would add more granularity to the data given they test several individuals above 100 yo.

Provenance - validation info only outputting to 1 row

The provenance validation version information is only being outputted to one row. Would like to see it for all rows so if the data gets subset or merged (etc.) the version information will be present for all specimen.

image

CNPHI Export - Symptom Label Conversions

Some symptoms have to have their label converted for the CNPHI export, as CNPHI already has established, equivalent labels and does not wish to change them.

Data Harmonizer "Host (common name)" Label CNPHI "Animal Type" Label
Cow bovine
Pig porcine
Data Harmonizer "Signs and Symptoms" Label CNPHI "Symptoms" Label
Acute Respiratory Distress Syndrome ARDS
Chills (sudden cold sensation) Chills
Conjunctivitis (pink eye) Conjunctivitis
Diarrhea (watery stool) Diarrhea, watery
Encephalitis (brain inflammation) Encephalitis
Fatigue (tiredness) Fatigue
Fever (>=38°C) Fever

Suggestion: Required Fields

Hi,

Considering that we want submitters to include as much information as possible beyond the required fields, perhaps we could amend the DH to show the purple optional columns alongside the yellow required ones when "Show required columns" is selected.

As it stands, showing just the required column headers may not be enabling or encouraging people to add in additional information.

Cheers,

Sarah

'Help'/'Support' Section in Validator

This may have been discussed in the past (my apologies if I touch on things that have already been addressed) but can he include a 'help'/'support' section that directs the users to:

  • The SOP
  • Where to make issue/term requests (either linking to GitHub Issues or to a curator email)

There is now a published version of the SOP that we can link too (it will update every time we edit the SOP gdoc), but perhaps we should also have a pdf copy that users can use if they are offline? That latter would certainly require more maintenance - we could include a note at the top that it may not be the latest version and to go to the published version if possible.

Organism Example Update

Noticed that example the organism field says "Severe acute respiratory coronavirus 2" but that is no longer an acceptable input. It appears to have been changed to "Severe acute respiratory syndrome coronavirus 2". Have updated to vocab sheet and the SOP example to the new label.

I think people might get very frustrated if we don't get this updated quickly because it's a field everyone has to use and there isn't any source telling them what it's actually supposed to be (from what I can tell).

CNPHI Export - "Animal Type" Issue.

CNPHI "Animal Type" field doesn't accept Humans.
Looks like we are mapping "host (common name)" to "Animal Type"; can we do so but omit "human"?

CNPHI Field Context:

Field: Specimen Source
Input: Human, Animal, Environment

Field: Animal Type
Input: Bat, Cat, Chicken, Civets, Cow, Dog, Lion, Pangolin, Pig, Pigeon, Tiger, Human

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.