Giter Site home page Giter Site logo

biosample-analysis's Introduction

Biosample analysis

Repo for analysis of biosamples in INSDC

Questions to explore

  • which attributes/properties are used
  • are these conformant to standards?
    • E.g. are MIxS fields used
    • Does the range constraint apply?
  • Can we mine ontology terms, e.g. ENVO from text descriptions
  • can we auto-populate metadata fields

Workflow

See Makefile for details

Analysis Data

In addition to the data in the target directory, sample data that is too large for GitHub is stored our Google drive here.
Files include:

  • biosample_set.xml.gz
    This is the full raw biosample dataset formatted as XML.

  • harmonized-values-eav.tsv.gz
    A tab-delimited file containing data extracted from biosample_set.xml.gz that contains the biosample's primary id and only the biosample attributes that have harmonized_name property. The data is in entity-attribute-value (EAV) format. The columns in the file are accession|attribute|value (accession is the accession number of the biosample).
    If necessary, use make target/harmonized-table.tsv to create the (non-zipped) file locally.

  • harmonized-table.tsv.gz
    A tab-delimited file in the data from harmonized-table.tsv.gz has been "pivoted" into a standard tabular format (i.e., the attributes are column headers). If necessary, use make harmonized-table.tsv to create the (non-zipped) file locally.

  • harmonized-attribute-value.ttl.gz
    A tab-delimited file in which the data from harmonized-values-eav.tsv.gz have been transformed into sets of turtle triples.
    If necessary, use make harmonized-attribute-value.ttl to create the (non-zipped) file locally.

  • harmonized-table.parquet.gz
    A parquet file containing the same contents as harmonized-table.tsv.gz. In pandas, you load like this: df = pds.read_parquet('harmonized-table.parquet.gz')
    You will need to have pyarrow installed (i.e., pip install pyarrow).
    If necessary, use make target/harmonized-table.parquet.gz to create the parquet file locally.
    Details of how to save the harmonized dataframe in parquet are found in save-harmonized-table-to-parquet.py.

  • harmonized_table.db.gz
    An sqlite database in which the biosample table contains the contents of harmonized-table.tsv.gz. Data is loaded into a pandas dataframe like this:

    con = sqlite3.connect('harmonized_table.db') # connect to database
    df = pds.read_sql('select * from biosample limit 10', con) # test loading 10 records
    

    NB: Loading all records (i.e, df = pds.read_sql('select * from biosample', con)) is a VERY time consuming and memory intensive. I gave up after letting the process run for 4 hours. If necessary, use make target/harmonized_table.db to create the (non-zipped) sqlite database locally.
    Details of how to save the harmonized dataframe in sqlite are found in save-harmonized-table-to-sqlite.py

Related

https://github.com/cmungall/metadata_converter

https://academic.oup.com/database/article/doi/10.1093/database/bav126/2630130

Example bad data

Depth

MIxS specifies this should be {number} {unit}

Some example values that do not conform:

  • N40.1164_W88.2543
  • 25 santimeters
  • 0 – 20 cm
  • 3.149
  • 30-60cm replicate6
  • 1800, 1800
  • 30ft
  • 5m, 32m, 70m, 110m, 200m, 320m, 1000m
  • Surface soil from deep water
  • 0 m water depth
  • Metamorph4 (19dpf) biological replicate 3

pH

  • pH 7.9
  • 6.0-9.5
  • 8,156
  • NA1
  • 2.75 (orig)
  • 5.11±0.10
  • Missing: Not reported
  • Not collected
  • 7.0-7.5 um
  • Moderately alkaline

Note that missing values do not correspond to:

https://gensc.org/uncategorized/reporting-missing-values/

ammonium

Should be {float} {unit}

  • 0.71 micro molar
  • 14.941
  • -0.024
  • 1.9 g NH4-N L-1
  • Below the deteciton limit (2 microM)
  • 3.09µg/L

Units vary from 'micro molar' through uM through mg/L

geo_loc_name

MIxS:

The geographical origin of the sample as defined by the country or sea name followed by specific region name. Country or sea names should be chosen from the INSDC country list (http://insdc.org/country.html), or the GAZ ontology (v 1.512) (http://purl.bioontology.org/ontology/GAZ)

{term};{term};{text}

  • USA: WA
  • USA:MO
  • USA: Boston, MA
  • USA:CA:Davis
  • United Kingdom: Midlands and East of England
  • Malawi: GAZ

biosample-analysis's People

Contributors

cmungall avatar hrshdhgd avatar realmarcin avatar turbomam avatar wdduncan avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

realmarcin

biosample-analysis's Issues

create one-hot encodings of nmdc biosample runNER output

The output of runNER against downloads/nmdc-gold-path-ner/nmdc-biosample-table-for-ner-20201016.tsv is stored in downloads/nmdc-gold-path-ner/runner.

Create a one-hot coded file of the named entities identified in runNER_Output.tsv

Normalize unit representations

In the data we see many representations for units. E.g.,

7.0grams
7 g/L
7.0 grams per liter

We need to standardize into form of:

  1. {float} {unit}
  2. spellings and abbreviations

Also, as an add on, we can add some conversion logic to get everything into the same unit measurments.

cc @cmungall @realmarcin @hrshdhgd

Map biosample table fields to MIxS packages

The specific task is to create a mapping file with the list of MIxS packages containing each field, if any. Since the biosample table already conforms to the MIxS schema, most field names should match MIxS field names with a small unmappeable remainder.

Where did these folders come from?

@wdduncan @cmungall

I have gensc.github.io/ in the root directory/master branch of INCATools/biosample-analysis. I think it had to do with some integration we were doing with mixs.yaml.

Same thing with queries/Script-1.sql. It looks like a pure SQL implementation of what I had done in in a notebook in the gold ontology repo. https://github.com/cmungall/gold-ontology/blob/main/notebooks/goldpaths_to_triads_by_proportion.ipynb

If they're considered "Untracked", could they have gotten into my file system from some other route than me creating them locally? Should I push them?

biosample-analysis % git status 
On branch master
Your branch is up to date with 'origin/master'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   Makefile
	modified:   target/emp_studies.tsv
	modified:   util/save-harmonized-table-to-sqlite.py

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	gensc.github.io/
	queries/

add additional biosample identifiers as columns

this depends on what's available in the XML file. any of EBI, INSDC, SRA sample ids would be useful for allowing to link to other data.

along those lines, sometimes samples have children and parents, so adding the parent sample ids would allow to group/stratify the biosample table rows more precisely.

expected fields errors when reading full biosample tsv with pandas

When reading the full biosample tsv table with pandas like this:

df_biosample = pd.read_cvs("harmonized-table.tsv", sep="\t")

This error pops up deep into the file:

ParserError: Error tokenizing data. C error: Expected 464 fields in line 4929258, saw 542

I've checked the offending line and its neighbors using awk to count tabs and they all have 463 tabs (hence 464 fields). I also looked through the fields in those lines and didn't see any odd characters, just the usual strings, identifiers separated by | and dates.

FWIW the same error occur when using skiprows=2 which is a suggestion to deal with problematic headers (shouldn't be the case here anyway).

It's a bit of a puzzle.

.gitignore additions in master

adding to .gitignore in master:

  • venv
  • target
    • Are there some files that we DO want to sync? I currently have ~ 12 files > 50 MB and have not even done all of the make steps
  • downloads
    • same question as above
  • .idea
    • from PyCharm?

get counts of distinct mixs triads

Find out how many of each distinct mixs triads are in the harmonized table. For this, I am going the use the sqlite file harmonized_table.db. This will me to easily execute a sql query against the biosample data.

Perform NER on sample metadata

Test NER on a large set of samples using text fields.
Fields to consider:

  • title
  • paragraph
  • taxonomy_name
  • env_package
  • ... add more in responses

Tools to use: OGER, SciGraph Annotator, others?
Need to test for plurals (e.g. deserts => desert).

cc @cmungall

normalize ENVO terms

These are mostly strings. Some do not correspond to a class label, e.g. 'tundra'

There should be a repair step that gets the IDs. I suggest a denormalized/flattened schema where we append _id onto the field name, e.g. env_local_scale_id=ENVO:nnnn. In the NMDC/MIxS schema this is a compound object

target/occurrences-%.tsv make recipe can't find tabs

This recipe searches for tab delimiters with this general pattern

egrep $'\t' target/attributes.tsv

@cmungall I guess that must work on your Mac, but it doesn't work in Ubuntu

Alternatives include

egrep $'\t' target/attributes.tsv

and

grep -P '\t' target/attributes.tsv

@wdduncan This is what we were just discussing this morning

make harmonized_table.db: no sars_cov_2_diag_pcr_ct_value_2 column

% time make target/harmonized_table.db

% time make target/harmonized_table.db
python ./util/save-harmonized-table-to-sqlite.py target/harmonized-table.tsv target/harmonized_table.db
reading chunks
saving as sqlite3
Traceback (most recent call last):
File "./util/save-harmonized-table-to-sqlite.py", line 24, in
chunk.to_sql(name='biosample', con=con, if_exists='append', index=False)
File "/Users/MAM/Documents/gitrepos/biosample-analysis/venv/lib/python3.8/site-packages/pandas/core/generic.py", line 2602, in to_sql
sql.to_sql(
File "/Users/MAM/Documents/gitrepos/biosample-analysis/venv/lib/python3.8/site-packages/pandas/io/sql.py", line 589, in to_sql
pandas_sql.to_sql(
File "/Users/MAM/Documents/gitrepos/biosample-analysis/venv/lib/python3.8/site-packages/pandas/io/sql.py", line 1828, in to_sql
table.insert(chunksize, method)
File "/Users/MAM/Documents/gitrepos/biosample-analysis/venv/lib/python3.8/site-packages/pandas/io/sql.py", line 830, in insert
exec_insert(conn, keys, chunk_iter)
File "/Users/MAM/Documents/gitrepos/biosample-analysis/venv/lib/python3.8/site-packages/pandas/io/sql.py", line 1555, in _execute_insert
conn.executemany(self.insert_statement(num_rows=1), data_list)
sqlite3.OperationalError: table biosample has no column named sars_cov_2_diag_pcr_ct_value_2
make: *** [target/harmonized_table.db] Error 1
make target/harmonized_table.db 694.43s user 153.44s system 95% cpu 14:43.92 total

scoped_mapping_of_biosamples_mixs "value is trying to be set on a copy of a slice"

Rookie mistake. I thought I addressed this :-(

mapping_candidates.loc[:, "onto_prefix"] = pd.Series(prefix_when_possible).values

/Users/MAM/Documents/gitrepos/biosample-analysis/venv/lib/python3.8/site-packages/pandas/core/indexing.py:1596: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self.obj[key] = _infer_fill_value(value)
/Users/MAM/Documents/gitrepos/biosample-analysis/venv/lib/python3.8/site-packages/pandas/core/indexing.py:1745: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
isetter(ilocs[0], value)

Additional fields to include

We need text additional fields for text mining experiments

e.g.

  <Description>
    <Title>Alistipes putredinis DSM 17216</Title>
    <Organism taxonomy_id="445970" taxonomy_name="Alistipes putredinis DSM 17216"/>
    <Comment>
      <Paragraph>Alistipes putredinis (GenBank Accession Number for 16S rDNA gene: L16497) is a member of the Bacteroidetes division of the domain bacteria and has been isolated from human feces. It has been found in 16S rDNA sequence-based enumerations of the colonic microbiota of adult humans (Eckburg et. al. (2005), Ley et. al. (2006)). </Paragraph>
      <Paragraph>Keywords: GSC:MIxS;MIGS:5.0</Paragraph>
    </Comment>
  </Description>

Add additional columns

  • title
  • organism_taxonomy_id
  • comment (concatenation of paragraphs)

Also we want to capture all external IDs

e.g.

    <Id db="SRA">SRS058998</Id>
    <Id db="BioSample" is_primary="1">SAMN00011046</Id>
    <Id db="GEO" db_label="Sample name">GSM531786</Id>
    <Id db="SRA">SRS058999</Id>
    <Id db="BioSample" is_primary="1">SAMN00011047</Id>
    <Id db="GEO" db_label="Sample name">GSM531787</Id>
    <Id db="SRA">SRS059000</Id>
    <Id db="BioSample" is_primary="1">SAMN00011048</Id>

I suggest either

  • one column, xrefs, with a value of a | concatenated set of CURIEs (e.g. SRA:SRS059000)
  • OR one column per prefix, single-valued

run NER/CR over all textual metadata fields

Execute RUNner over

  • study table (see #1)
  • sample table

Run over all textual fields, in particular:

  • description fields
  • envo triad fiels

Vocabularies: ENVO, CHEBI, NCBITaxon in text fields, specifically

this can then be used to repair the tsv to insert the correct identifier for the ENVO class; also for prediction

Perl modules required by Makefile recipes

make target/harmonized-attributes-only-eav.tsv

gzip -dc downloads/biosample_set.xml.gz | ./util/harmonized-attributes-only-eav.pl > target/harmonized-attributes-only-eav.tsv
Can't locate Text/Trim.pm in @inc (you may need to install the Text::Trim module) (@inc contains: /Library/Perl/5.18/darwin-thread-multi-2level /Library/Perl/5.18 /Network/Library/Perl/5.18/darwin-thread-multi-2level /Network/Library/Perl/5.18 /Library/Perl/Updates/5.18.4 /System/Library/Perl/5.18/darwin-thread-multi-2level /System/Library/Perl/5.18 /System/Library/Perl/Extras/5.18/darwin-thread-multi-2level /System/Library/Perl/Extras/5.18 .) at ./util/harmonized-attributes-only-eav.pl line 2.
BEGIN failed--compilation aborted at ./util/harmonized-attributes-only-eav.pl line 2.
make: *** [target/harmonized-attributes-only-eav.tsv] Error 2

perl -MCPAN -e'install Text::Trim' with "mostly automatic CPAN configuration" solved the problem for me.

I have been closing and restarting my shell after these CPAN installations. Is that really necessary?

Are all fields at the Study level inherited by Samples?

For example, Study object has a field

specific_ecosystem

which seems like a more specific terms then some of the other ecosystem terms. But it is only present at the Study level and not Sample.

When flattening the NMDC Sample data, these Study fields would either go to a separate linked filed for Study sets or would be pushed down to the Sample level.

explore TPOT package for more automated transform, feature and model selection

... this is for once we have a usable numeric matrix for analysis/ML.

The TPOT package is actually an amazing feat I think, wish they had made it sooner!
https://github.com/EpistasisLab/tpot

The notebook I committed:
https://github.com/realmarcin/biosample-analysis/blob/master/notebooks/first_analysis_notebook.ipynb

is a pipeline prototype going all the way from the original TSV (or some subset), to a numeric matrix passed to DecisionTree and viz. Lots more can be added, but this pipeline could already be plugged into TPOT.

Include local ontology's labels in human review spreadsheet

Notebook scoped_mapping_of_biosamples_mixs maps strings from MIxS trisd columns in the INSDC BioSampleMetadata collection to OBO Foundry terms. Those mappings can currently come from ontologies that import terms and assign their own annotations.

The notebook creates a dataframe called best_and_salvage, which is saved to the clipboard, for human review in a spreadsheet

best_and_salvage should include the local ontolgy's label in addition to the importer's label

target/%MIxS_columns.tsv target contains no %

Makefile:126: *** target pattern contains no `%'. Stop.

Line 126 starts with

target/%MIxS_columns.tsv: https://github.com/cmungall/mixs-source/tree/main/src/schema
# This notebook generates two files : MIxS_columns.tsv and Non_MIxS_columns.tsv.
# Highlights the data column names that are MIxS terms and non-MIxS terms
	jupyter nbconvert --execute --clear-output src/notebooks/MIxS_comparison.ipynb 

I commented that recipe out locally and can now run make downloads/emp.tsv for example

add make command to sync biosample files

The large files (e.g., harmonized_table.db) are stored on the google drive.

Add targets to the Makefile (e.g., sync-harmonized_table.db) that will download the data files from the google drive (e.g. using wget or curl).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.