incatools / biosample-analysis Goto Github PK

View Code? Open in Web Editor NEW

3.0 7.0 1.0 99.93 MB

analysis of biosamples in INSDC

Makefile 0.08% Perl 0.26% Jupyter Notebook 99.11% Python 0.23% Shell 0.02% R 0.02% XQuery 0.29%

biosample-analysis's Introduction

Biosample analysis

Repo for analysis of biosamples in INSDC

Questions to explore

which attributes/properties are used
are these conformant to standards?
- E.g. are MIxS fields used
- Does the range constraint apply?
Can we mine ontology terms, e.g. ENVO from text descriptions
can we auto-populate metadata fields

Workflow

See Makefile for details

Analysis Data

In addition to the data in the target directory, sample data that is too large for GitHub is stored our Google drive here.
Files include:

biosample_set.xml.gz
This is the full raw biosample dataset formatted as XML.
harmonized-values-eav.tsv.gz
A tab-delimited file containing data extracted from biosample_set.xml.gz that contains the biosample's primary id and only the biosample attributes that have harmonized_name property. The data is in entity-attribute-value (EAV) format. The columns in the file are accession|attribute|value (accession is the accession number of the biosample).
If necessary, use make target/harmonized-table.tsv to create the (non-zipped) file locally.
harmonized-table.tsv.gz
A tab-delimited file in the data from harmonized-table.tsv.gz has been "pivoted" into a standard tabular format (i.e., the attributes are column headers). If necessary, use make harmonized-table.tsv to create the (non-zipped) file locally.
harmonized-attribute-value.ttl.gz
A tab-delimited file in which the data from harmonized-values-eav.tsv.gz have been transformed into sets of turtle triples.
If necessary, use make harmonized-attribute-value.ttl to create the (non-zipped) file locally.
harmonized-table.parquet.gz
A parquet file containing the same contents as harmonized-table.tsv.gz. In pandas, you load like this: df = pds.read_parquet('harmonized-table.parquet.gz')
You will need to have pyarrow installed (i.e., pip install pyarrow).
If necessary, use make target/harmonized-table.parquet.gz to create the parquet file locally.
Details of how to save the harmonized dataframe in parquet are found in save-harmonized-table-to-parquet.py.
harmonized_table.db.gz
An sqlite database in which the biosample table contains the contents of harmonized-table.tsv.gz. Data is loaded into a pandas dataframe like this:
```
con = sqlite3.connect('harmonized_table.db') # connect to database
df = pds.read_sql('select * from biosample limit 10', con) # test loading 10 records
```
NB: Loading all records (i.e, df = pds.read_sql('select * from biosample', con)) is a VERY time consuming and memory intensive. I gave up after letting the process run for 4 hours. If necessary, use make target/harmonized_table.db to create the (non-zipped) sqlite database locally.
Details of how to save the harmonized dataframe in sqlite are found in save-harmonized-table-to-sqlite.py

https://github.com/cmungall/metadata_converter

https://academic.oup.com/database/article/doi/10.1093/database/bav126/2630130

Example bad data

Depth

MIxS specifies this should be {number} {unit}

Some example values that do not conform:

N40.1164_W88.2543
25 santimeters
0 – 20 cm
3.149
30-60cm replicate6
1800, 1800
30ft
5m, 32m, 70m, 110m, 200m, 320m, 1000m
Surface soil from deep water
0 m water depth
Metamorph4 (19dpf) biological replicate 3

pH

pH 7.9
6.0-9.5
8,156
NA1
2.75 (orig)
5.11±0.10
Missing: Not reported
Not collected
7.0-7.5 um
Moderately alkaline

Note that missing values do not correspond to:

https://gensc.org/uncategorized/reporting-missing-values/

ammonium

Should be {float} {unit}

0.71 micro molar
14.941
-0.024
1.9 g NH4-N L-1
Below the deteciton limit (2 microM)
3.09µg/L

Units vary from 'micro molar' through uM through mg/L

geo_loc_name

MIxS:

The geographical origin of the sample as defined by the country or sea name followed by specific region name. Country or sea names should be chosen from the INSDC country list (http://insdc.org/country.html), or the GAZ ontology (v 1.512) (http://purl.bioontology.org/ontology/GAZ)

{term};{term};{text}

USA: WA
USA:MO
USA: Boston, MA
USA:CA:Davis
United Kingdom: Midlands and East of England
Malawi: GAZ

biosample-analysis's People

Contributors

Stargazers

Watchers

Forkers

realmarcin

biosample-analysis's Issues

Expand TSV to include closures of ontology classes (primarily EnvO)

You can use:

https://github.com/cmungall/term-expando

(we may need to make this configurable to allow parquet files)

Only collect harmonized_name when building sample TSV

We only care about harmonized_name (ie mixs) for now. We can do an analysis of others later.

This should substantially reduce the time to build the table, as well as the resulting size

normalize package names

This should be done as a pre-processing step, part of overall ETL pipeline, such that each individual analysis does not need to do normalization

Currently done for water packages here:
https://nbviewer.jupyter.org/github/INCATools/biosample-analysis/blob/master/src/notebooks/water-package-profiling.ipynb

I am envisioning a general toolkit that performs this kind of repair on the whole TSV

create one-hot encodings of nmdc biosample runNER output

The output of runNER against downloads/nmdc-gold-path-ner/nmdc-biosample-table-for-ner-20201016.tsv is stored in downloads/nmdc-gold-path-ner/runner.

Create a one-hot coded file of the named entities identified in runNER_Output.tsv

Reconsider pasting of entrez link parts

A biosample can have multiple entrez finks

Discussed with Bill yesterday. Add notes here.

Build classifier using biosample training data

Still aligning on the first input dataset, but good candidate is here -- which is the full biosample table from Bill after Harshad ran NER on it:
#31 (comment)
First step is to just rerun existing data summary and analysis notebook on updated real data. This is the notebook I am updating first:
https://github.com/INCATools/biosample-analysis/blob/master/src/notebooks/from_tsv_to_numeric_to_DTree.ipynb

cleanup commit 441a5fab9fca307e09bdd351176736a85b6cc3b2

441a5fa

contains xquery appraoch

pandas describe() output from parquet file

The parquet file loaded on my laptop in < 30 min, and the describe() call for this data frame (for all columns) took a few minutes. I am linking the describe() result in this ticket. Its a TSV file but github didn't like ...

harmonized-table.parquet_describe.txt

Translate to KGX edges format and use embiggen to embed and perform link prediction

Nodes:

biosample
study
one each for 3 fields in ENVO triad
ENVO ancestors (use kgx cli to turn envo into kgx)

Normalize unit representations

In the data we see many representations for units. E.g.,

7.0grams
7 g/L
7.0 grams per liter

We need to standardize into form of:

{float} {unit}
spellings and abbreviations

Also, as an add on, we can add some conversion logic to get everything into the same unit measurments.

cc @cmungall @realmarcin @hrshdhgd

Map biosample table fields to MIxS packages

The specific task is to create a mapping file with the list of MIxS packages containing each field, if any. Since the biosample table already conforms to the MIxS schema, most field names should match MIxS field names with a small unmappeable remainder.

Where did these folders come from?

@wdduncan @cmungall

I have gensc.github.io/ in the root directory/master branch of INCATools/biosample-analysis. I think it had to do with some integration we were doing with mixs.yaml.

Same thing with queries/Script-1.sql. It looks like a pure SQL implementation of what I had done in in a notebook in the gold ontology repo. https://github.com/cmungall/gold-ontology/blob/main/notebooks/goldpaths_to_triads_by_proportion.ipynb

If they're considered "Untracked", could they have gotten into my file system from some other route than me creating them locally? Should I push them?

biosample-analysis % git status 
On branch master
Your branch is up to date with 'origin/master'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   Makefile
	modified:   target/emp_studies.tsv
	modified:   util/save-harmonized-table-to-sqlite.py

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	gensc.github.io/
	queries/

add additional biosample identifiers as columns

this depends on what's available in the XML file. any of EBI, INSDC, SRA sample ids would be useful for allowing to link to other data.

along those lines, sometimes samples have children and parents, so adding the parent sample ids would allow to group/stratify the biosample table rows more precisely.

add DOI to pivot table

Add the doi from the XML records to the large pivot table created from the biosample XML.

cc @hrshdhgd @realmarcin

expected fields errors when reading full biosample tsv with pandas

When reading the full biosample tsv table with pandas like this:

df_biosample = pd.read_cvs("harmonized-table.tsv", sep="\t")

This error pops up deep into the file:

ParserError: Error tokenizing data. C error: Expected 464 fields in line 4929258, saw 542

I've checked the offending line and its neighbors using awk to count tabs and they all have 463 tabs (hence 464 fields). I also looked through the fields in those lines and didn't see any odd characters, just the usual strings, identifiers separated by | and dates.

FWIW the same error occur when using skiprows=2 which is a suggestion to deal with problematic headers (shouldn't be the case here anyway).

It's a bit of a puzzle.

.gitignore additions in master

adding to .gitignore in master:

venv
target
- Are there some files that we DO want to sync? I currently have ~ 12 files > 50 MB and have not even done all of the make steps
downloads
- same question as above
.idea
- from PyCharm?

experimental factor

we should look at this field

does it match existing ontologies
- OBI (more suited for human packages?)
- EFO
- PECO (for plant or environmental)
- ...?

identify key set of quantitative fields to predict environment terms

Chris is suggesting:

salinity, depth, altitude

So looking to convert these to real column names and see what else could be included.

use consistent case for ontology names

can't make target/emp_studies.tsv: empty CSV key at downloads/emp_studies.csv line 1

% make target/emp_studies.tsv
mlr --csv --otsv cat downloads/emp_studies.csv > target/emp_studies.tsv
mlr: unacceptable empty CSV key at file "downloads/emp_studies.csv" line 1.
make: *** [target/emp_studies.tsv] Error 1

Find some patterns of dubious mappings to discuss with SMEs

Most of these examples come from the soil package, env_broad

loss of biome and slight reordering:
temperate broadleaf and mixed forest biome -> temperate mixed broadleaf forest, ENVO:01000389

get counts of distinct mixs triads

Find out how many of each distinct mixs triads are in the harmonized table. For this, I am going the use the sqlite file harmonized_table.db. This will me to easily execute a sql query against the biosample data.

repairing env_sample and mixs triad within biosample-analysis

used to take place in scoped-mapping
still requires that package, although its still in PyPI test
also requires some SQLite databases that are created by scoped-mapping

Perform NER on sample metadata

Test NER on a large set of samples using text fields.
Fields to consider:

title
paragraph
taxonomy_name
env_package
... add more in responses

Tools to use: OGER, SciGraph Annotator, others?
Need to test for plurals (e.g. deserts => desert).

cc @cmungall

normalize ENVO terms

These are mostly strings. Some do not correspond to a class label, e.g. 'tundra'

There should be a repair step that gets the IDs. I suggest a denormalized/flattened schema where we append _id onto the field name, e.g. env_local_scale_id=ENVO:nnnn. In the NMDC/MIxS schema this is a compound object

Subset columns to MIxS terms (version 5)

Create a version of biosample data whose columns are MIxS 5 terms. Not all the harmonized names (e.g., 'fire') MIxS terms.

cc @cmungall @realmarcin

Minimal accounting of SQLite vs XQuery columns

target/occurrences-%.tsv make recipe can't find tabs

This recipe searches for tab delimiters with this general pattern

egrep $'\t' target/attributes.tsv

@cmungall I guess that must work on your Mac, but it doesn't work in Ubuntu

Alternatives include

egrep $'\t' target/attributes.tsv

and

grep -P '\t' target/attributes.tsv

@wdduncan This is what we were just discussing this morning

Have been inconsistent about joining and splitting on "|" vs "|||"

env_broad_scale created byget_harmonized-values.xq currently uses "|||"

make harmonized_table.db: no sars_cov_2_diag_pcr_ct_value_2 column

% time make target/harmonized_table.db

% time make target/harmonized_table.db
python ./util/save-harmonized-table-to-sqlite.py target/harmonized-table.tsv target/harmonized_table.db
reading chunks
saving as sqlite3
Traceback (most recent call last):
File "./util/save-harmonized-table-to-sqlite.py", line 24, in
chunk.to_sql(name='biosample', con=con, if_exists='append', index=False)
File "/Users/MAM/Documents/gitrepos/biosample-analysis/venv/lib/python3.8/site-packages/pandas/core/generic.py", line 2602, in to_sql
sql.to_sql(
File "/Users/MAM/Documents/gitrepos/biosample-analysis/venv/lib/python3.8/site-packages/pandas/io/sql.py", line 589, in to_sql
pandas_sql.to_sql(
File "/Users/MAM/Documents/gitrepos/biosample-analysis/venv/lib/python3.8/site-packages/pandas/io/sql.py", line 1828, in to_sql
table.insert(chunksize, method)
File "/Users/MAM/Documents/gitrepos/biosample-analysis/venv/lib/python3.8/site-packages/pandas/io/sql.py", line 830, in insert
exec_insert(conn, keys, chunk_iter)
File "/Users/MAM/Documents/gitrepos/biosample-analysis/venv/lib/python3.8/site-packages/pandas/io/sql.py", line 1555, in _execute_insert
conn.executemany(self.insert_statement(num_rows=1), data_list)
sqlite3.OperationalError: table biosample has no column named sars_cov_2_diag_pcr_ct_value_2
make: *** [target/harmonized_table.db] Error 1
make target/harmonized_table.db 694.43s user 153.44s system 95% cpu 14:43.92 total

scoped_mapping_of_biosamples_mixs "value is trying to be set on a copy of a slice"

Rookie mistake. I thought I addressed this :-(

mapping_candidates.loc[:, "onto_prefix"] = pd.Series(prefix_when_possible).values

/Users/MAM/Documents/gitrepos/biosample-analysis/venv/lib/python3.8/site-packages/pandas/core/indexing.py:1596: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self.obj[key] = _infer_fill_value(value)
/Users/MAM/Documents/gitrepos/biosample-analysis/venv/lib/python3.8/site-packages/pandas/core/indexing.py:1745: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
isetter(ilocs[0], value)

Export canonical TSV for NMDC Sample data

I assume this would be in KGX format?

normalize missing values

e.g. Not applicable, missing data, etc

reorganize notebooks directory

just pull out the ones related to scoped mapping?

Additional fields to include

We need text additional fields for text mining experiments

e.g.

  <Description>
    <Title>Alistipes putredinis DSM 17216</Title>
    <Organism taxonomy_id="445970" taxonomy_name="Alistipes putredinis DSM 17216"/>
    <Comment>
      <Paragraph>Alistipes putredinis (GenBank Accession Number for 16S rDNA gene: L16497) is a member of the Bacteroidetes division of the domain bacteria and has been isolated from human feces. It has been found in 16S rDNA sequence-based enumerations of the colonic microbiota of adult humans (Eckburg et. al. (2005), Ley et. al. (2006)). </Paragraph>
      <Paragraph>Keywords: GSC:MIxS;MIGS:5.0</Paragraph>
    </Comment>
  </Description>

Add additional columns

title
organism_taxonomy_id
comment (concatenation of paragraphs)

Also we want to capture all external IDs

e.g.

    <Id db="SRA">SRS058998</Id>
    <Id db="BioSample" is_primary="1">SAMN00011046</Id>
    <Id db="GEO" db_label="Sample name">GSM531786</Id>
    <Id db="SRA">SRS058999</Id>
    <Id db="BioSample" is_primary="1">SAMN00011047</Id>
    <Id db="GEO" db_label="Sample name">GSM531787</Id>
    <Id db="SRA">SRS059000</Id>
    <Id db="BioSample" is_primary="1">SAMN00011048</Id>

I suggest either

one column, xrefs, with a value of a | concatenated set of CURIEs (e.g. SRA:SRS059000)
OR one column per prefix, single-valued

create NMDC biosample file for OGER/runNER

In order to do NER on the NMDC biosample table, create a slimmed down biosample table of fields used to determine GOLD paths.

cc @hrshdhgd

scoped_mapping_of_biosamples_mixs check existing SSSOMs before OLS

run NER/CR over all textual metadata fields

Execute RUNner over

study table (see #1)
sample table

Run over all textual fields, in particular:

description fields
envo triad fiels

Vocabularies: ENVO, CHEBI, NCBITaxon in text fields, specifically

this can then be used to repair the tsv to insert the correct identifier for the ENVO class; also for prediction

Can't find issue for reconciling curated NMDC ENVO subsets with INSDC ustilized terms

Also can't remember where the curated NMDC ENVO subset spreadsheet is

Validate harmonized attributes using MIxS patterns

Use regexes in mixs to validate. Look at README for example of violations

Perl modules required by Makefile recipes

make target/harmonized-attributes-only-eav.tsv

gzip -dc downloads/biosample_set.xml.gz | ./util/harmonized-attributes-only-eav.pl > target/harmonized-attributes-only-eav.tsv
Can't locate Text/Trim.pm in @inc (you may need to install the Text::Trim module) (@inc contains: /Library/Perl/5.18/darwin-thread-multi-2level /Library/Perl/5.18 /Network/Library/Perl/5.18/darwin-thread-multi-2level /Network/Library/Perl/5.18 /Library/Perl/Updates/5.18.4 /System/Library/Perl/5.18/darwin-thread-multi-2level /System/Library/Perl/5.18 /System/Library/Perl/Extras/5.18/darwin-thread-multi-2level /System/Library/Perl/Extras/5.18 .) at ./util/harmonized-attributes-only-eav.pl line 2.
BEGIN failed--compilation aborted at ./util/harmonized-attributes-only-eav.pl line 2.
make: *** [target/harmonized-attributes-only-eav.tsv] Error 2

perl -MCPAN -e'install Text::Trim' with "mostly automatic CPAN configuration" solved the problem for me.

I have been closing and restarting my shell after these CPAN installations. Is that really necessary?

Are all fields at the Study level inherited by Samples?

For example, Study object has a field

specific_ecosystem

which seems like a more specific terms then some of the other ecosystem terms. But it is only present at the Study level and not Sample.

When flattening the NMDC Sample data, these Study fields would either go to a separate linked filed for Study sets or would be pushed down to the Sample level.

explore TPOT package for more automated transform, feature and model selection

... this is for once we have a usable numeric matrix for analysis/ML.

The TPOT package is actually an amazing feat I think, wish they had made it sooner!
https://github.com/EpistasisLab/tpot

The notebook I committed:
https://github.com/realmarcin/biosample-analysis/blob/master/notebooks/first_analysis_notebook.ipynb

is a pipeline prototype going all the way from the original TSV (or some subset), to a numeric matrix passed to DecisionTree and viz. Lots more can be added, but this pipeline could already be plugged into TPOT.

Include local ontology's labels in human review spreadsheet

Notebook scoped_mapping_of_biosamples_mixs maps strings from MIxS trisd columns in the INSDC BioSampleMetadata collection to OBO Foundry terms. Those mappings can currently come from ontologies that import terms and assign their own annotations.

The notebook creates a dataframe called best_and_salvage, which is saved to the clipboard, for human review in a spreadsheet

best_and_salvage should include the local ontolgy's label in addition to the importer's label

rebuild harmonized_table with xquery & Bill's Perl pivot

add checklist

Makefile:126: *** target pattern contains no `%'. Stop.

Line 126 starts with

target/%MIxS_columns.tsv: https://github.com/cmungall/mixs-source/tree/main/src/schema
# This notebook generates two files : MIxS_columns.tsv and Non_MIxS_columns.tsv.
# Highlights the data column names that are MIxS terms and non-MIxS terms
	jupyter nbconvert --execute --clear-output src/notebooks/MIxS_comparison.ipynb

I commented that recipe out locally and can now run make downloads/emp.tsv for example

revert Makefile recipes that weren't ready for master

add make command to sync biosample files

The large files (e.g., harmonized_table.db) are stored on the google drive.

Add targets to the Makefile (e.g., sync-harmonized_table.db) that will download the data files from the google drive (e.g. using wget or curl).