lazear / sage Goto Github PK

View Code? Open in Web Editor NEW

209.0 12.0 38.0 9.4 MB

Proteomics search & quantification so fast that it feels like magic

Home Page: https://sage-docs.vercel.app

License: MIT License

Rust 99.91% Dockerfile 0.09%

bioinformatics proteomics mass-spectrometry

sage's Introduction

Sage: proteomics searching so fast it seems like magic

For more information please read the online documentation!

Introduction

Sage is, at it's core, a proteomics database search engine - a tool that transforms raw mass spectra from proteomics experiments into peptide identifications via database searching & spectral matching.

However, Sage includes a variety of advanced features that make it a one-stop shop: retention time prediction, quantification (both isobaric & LFQ), peptide-spectrum match rescoring, and FDR control. You can directly use results from Sage without needing to use other tools for these tasks.

Additionally, Sage was designed with cloud computing in mind - massively parallel processing and the ability to directly stream compressed mass spectrometry data to/from AWS S3 enables unprecedented search speeds with minimal cost.

Sage also runs just as well reading local files from your Mac/PC/Linux device!

Why use Sage instead of other tools?

Sage is simple to configure, powerful and flexible. It also happens to be well-tested, mind-boggingly fast, open-source (MIT-licensed) and free.

Citation

If you use Sage in a scientific publication, please cite the following paper:

Sage: An Open-Source Tool for Fast Proteomics Searching and Quantification at Scale

Features

Incredible performance out of the box
Effortlessly cross-platform (Linux/MacOS/Windows), effortlessly parallel (uses all of your CPU cores)
Fragment indexing strategy allows for blazing fast narrow and open searches (> 500 Da precursor tolerance)
Isobaric quantification (MS2/MS3-TMT, or custom reporter ions)
Label-free quantification: consider all charge states & isotopologues a la FlashLFQ
Capable of searching for chimeric/co-fragmenting spectra
Wide-window (dynamic precursor tolerance) search mode - enables WWA/PRM/DIA searches
Retention time prediction models fit to each LC/MS run
PSM rescoring using built-in linear discriminant analysis (LDA)
PEP calculation using a non-parametric model (KDE)
FDR calculation using target-decoy competition and picked-peptide & picked-protein approaches
Percolator/Mokapot compatible output
Configuration by JSON file
Built-in support for reading gzipped-mzML files
Support for reading/writing directly from AWS S3

Interoperability

Sage is well-integrated into the open-source proteomics ecosystem. The following projects support analyzing results from Sage (typically in addition to other tools), or redistribute Sage binaries for use in their pipelines.

SearchGUI: a graphical user interface for running searches
PeptideShaker: visualize peptide-spectrum matches
MS2Rescore: AI-assisted rescoring of results
Picked group FDR: scalable protein group FDR for large-scale experiments
sagepy: Python bindings to the sage-core library
quantms: nextflow pipeline for running searches with Sage
OpenMS: Sage is included as a "TOPP" tool in OpenMS
sager: R package for analyzing results from Sage searches
Sage results to mzIdentML: Bash script to convert results.sage.tsv files to mzIdentML
If your project supports Sage and it's not listed, please open a pull request! If you need help integrating or interfacing with Sage in some way, please reach out.

Check out the (now outdated) blog post introducing the first version of Sage for more information and full benchmarks!

sage's People

Contributors

Stargazers

Watchers

Forkers

bathy wfondrie bigbio jspaezp cpanse elendol hbarsnes lgatto nwamsley1 ttcooper-phd straussmaximilian anuragraj kurtshowmaker jalew188 animesh trishorts nottirb mayhemheroes jiaweim robinparky jaspervdh sander-willems-bruker gabenavarro friedlabjhu marioernestovaldes ed-lau vijay-gnanasambandan-bruker grosenberger-bruker thegreatherrlebert yujinjeong2 arnscott matthewthe mass-matrix mobiusklein wtbxsjy ralfg shelbyrb218 acesnik

sage's Issues

Option to search for optional modifications

Hi,

First of all, thanks for implementing this algorithm and making it open source!
I'm aware that this tool is not finished (at least from what I understood from the blog post) and that not all features are implemented yet.
I tried the tool and it looks very promising, however I was not able to find an option to specify optional modifications.
Is this something you are working on / something which will be implemented in the future, or am I just missing on how to do that?

Best,
Manuel

Naming inconsistency in output files / documentation

Hi @lazear,

First and foremost, thanks for the great work!

Continuing the conversation from compomics/psm_utils#31: It seems that internally, the _q suffix is indeed used as is also documented for the result.sage.tsv output:
https://github.com/lazear/sage/blob/5f95d454f9b126cf93b2ec96443f4aaa8a88f588/crates/sage-cli/src/output.rs#L62C21-L64

However, the header names use the _fdr suffix:
https://github.com/lazear/sage/blob/f55a9e525cf353de94e194428a3a5615d0cbab8c/crates/sage-cli/src/output.rs#L113C18-L115

Personally, I think the _q suffix makes more sense, but for backwards compatibility (which might not be so much of an issue yet), you could opt to keep the _fdr suffix.

How to deal with DIA data?

Hi Lazear,

Great software Sage is. I am wondering whether Sage could deal with DIA data? Is there some detailed introduction about that?
Many thanks.

Bests,
Shisheng

Support compiling to WebAssembly

Heya! This looks like a sick project!

As part of my PhD I'll be spending some time improving the tools available for peptioglycomics (https://elifesciences.org/articles/70597), and I'll have 4ish months to dedicate full time to that. While I'll have to write a fragment predictor on my own, sage could be an outstanding engine for actually finding the fragments I predict!

Long term though, I plan on building a Web GUI and either running the Rust backend natively through Tauri (https://tauri.app/) or, even better (for accessibility), on WebAssembly. If all of the Rust compiled down to WASM, I could just host a free Github pages website and serve a no-install GUI to anyone interested in using our tool!

While Rust's support for WebAssembly is generally outstanding, some low-level APIs can't be compiled to the WASM platform! Doing some preliminary testing with sage (cargo build --target wasm32-unknown-unknown), I get errors from Mio (https://github.com/tokio-rs/mio) which is pulled in by Tokio and ultimately sage-cloudpath.

I'm not sure if there are other issues lurking, but to be honest, the cloudpath support (if I understand correctly that that's what AWS support depends on) seems like an appropriate bit of functionality to put behind a feature flag! That way we can keep AWS support in by default (and my Tauri app could even use it), but I could also disable the feature in my Cargo.toml to compile everything down to WebAssembly!

If you have the time to play with things so that they compile to WebAssembly, that would be great, otherwise I'll consider this a note to myself and something to work on in a few months' time!

Thanks again for the outstanding tool and super helpful blog post about it!

Protein inference

Hi,

Does Sage include an option for protein inference from the PSMs? If not, are the output files (either .tsv or .pin) compatible with any good tools that can infer proteins from the PSMs, or do you have a preferred way of doing this?

Many thanks for developing Sage - it's a great tool!

Memory usage

Is there a parameter to reduce the memory usage when the database is very large?

Output directory

How do you specify the output directory in the .json file? I defined the path as shown below but it seems that the results are not saved in the directory I specified and I cannot locate them. Is there a default path where I could retrieve them?

"output_paths": [ "/share/mlevasseur/bruker/scp/processed/sage/20230314/" ]

Multiple modifications to same residue

Hi Michael,

Is it possible to add several variable modifications to the same site? When I tried including the following modifications in my config.json file:

"variable_mods": {      
      "M": 15.9949,         
      "[": 42.0,            
      "[": 227.98237,
      "[": 331.04570,
      "K": 227.98237,
      "K": 331.04570
    }

only the following modifications seemed to be used:

 "variable_mods": {
      "M": 15.9949,
      "[": 331.0457,
      "K": 331.0457
    }

Running sage output in percolator

Hi,
thanks for your great tool and for making it open-source!
I just ran into a small issue. If I call percolator on the sage output I get an error:

Started Mon Mar 20 10:45:47 2023
Hyperparameters: selectionFdr=0.01, Cpos=0, Cneg=0, maxNiter=10
Reading tab-delimited input from datafile results.sage.tsv

WARNING: Tab delimited input does not contain ScanNr column,
         scan numbers will be assigned automatically.

Features:
num_proteins
Exception caught: ERROR: Reading tab file, error reading PSM on line 2. Could not read label.

could it be percolator is case-sensitive and doesn't recognize "scannr"? Probably same for other columns?

Problem analysing tims-tof data

Hello,

I currently wanted to test sage and compare it to other Software tools but cannot get it to produce results. Error.txt is the detailed output i get when running sage. As input I use tims-tof (.d)-converted .mzML files which i thought might be the issue. Hence tried different methods of generating these:

I used the guide as described at the bottom of this page.
I first converted the .d folders to .mgf with the alphatims package and then again used msconvert-GUI but with no changes to the default parameters (in contrast to the guide). This yielded a much smaller mzml file (500 mb vs 5gb for the "same" mzML files in 1.) so there are definitly some changes.

This is the fasta i use: NIST 8671_2021-11-22.txt
Could it be that i just have to few proteins in the fasta? Otherwise have not changed any of the default parameters of the
results.json except the path variables of fasta and output paths.

I hope you can help me get more meaningful results and thank you in advance!!

Tom

Parsing mismatch of last protein sequence in fasta.rs

We get the following error when using one of our fasta files:

thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', src/fasta.rs:41:57

We seem to have fixed it by updating line 41 of fasta.rs:

let acc: String = last_id.split('|').nth(1).unwrap().into();

to match the same parsing rule as in line 27:

let acc: String = last_id.split_ascii_whitespace().next().unwrap().into();

Enable integration with SearchGUI

Tracking issue for SearchGUI integration

Enable use of enzymes other than trypsin
Enable more verbose output
#29

Question: no search results on small mzMLs

Hi, for CI/testing purposes I ran a search on an unfiltered mzML.
For CI/testing purposes, I extracted an identified spectrum, but now I don't get any ids.
Is intended and a result of internal FDR filtering? If so, can the filtering be disabled?
Best,
Timo

{
  "version": "0.13.3",
  "database": {
    "bucket_size": 32768,
    "enzyme": {
      "missed_cleavages": 2,
      "min_len": 5,
      "max_len": 50,
      "cleave_at": "KR",
      "restrict": "P",
      "c_terminal": true
    },
    "fragment_min_mz": 200.0,
    "fragment_max_mz": 2000.0,
    "peptide_min_mass": 500.0,
    "peptide_max_mass": 5000.0,
    "ion_kinds": [
      "b",
      "y"
    ],
    "min_ion_index": 2,
    "static_mods": {
      "C": 57.021465
    },
    "variable_mods": {
      "M": [
        15.994915
      ]
    },
    "max_variable_mods": 2,
    "decoy_tag": "DECOY_",
    "generate_decoys": false,
    "fasta": "iPRG2015_decoy.fasta"
  },
  "quant": {
    "tmt": null,
    "tmt_settings": {
      "level": 3,
      "sn": false
    },
    "lfq": false,
    "lfq_settings": {
      "peak_scoring": "Hybrid",
      "integration": "Sum",
      "spectral_angle": 0.7,
      "ppm_tolerance": 5.0
    }
  },
  "precursor_tol": {
    "ppm": [
      -6.0,
      6.0
    ]
  },
  "fragment_tol": {
    "ppm": [
      -20.0,
      20.0
    ]
  },
  "isotope_errors": [
    -1,
    3
  ],
  "deisotope": false,
  "chimera": false,
  "wide_window": false,
  "min_peaks": 15,
  "max_peaks": 150,
  "max_fragment_charge": null,
  "min_matched_peaks": 6,
  "report_psms": 1,
  "predict_rt": false,
  "mzml_paths": [
    "SageAdapter_1.mzML"
  ],
  "output_paths": [
    "/home/sachsenb/results.sage.pin",
    "/home/sachsenb/results.sage.tsv",
    "/home/sachsenb/results.json"
  ]
}

Support modification of amino acids at peptide or protein termini

Hi,

Thanks for leading this exciting project. Some common modifications target amino acids at peptide or protein termini.
Examples:
https://www.thegpm.org/TANDEM/api/pqp.html
https://www.thegpm.org/TANDEM/api/pqa.html

Last time I checked, this was not supported. It would be great to have it implemented - or some pointers on how we can help with this :-)

Many thanks,

Marc

System cannot find raw file

Hello Michael,

I ran into a problem in that the raw file cannot be read. I am using Windows 10.

In CMD I have

cd into the sage folder
moved the followings into the sage folder: my.fasta, test.json, LQSRPAAPPAPGPGQLTLR.mzML

test.json:
{
"database": {
"bucket_size": 8192,
"fragment_min_mz": 135.0,
"fragment_max_mz": 1350.0,
"peptide_min_len": 6,
"peptide_max_len": 65,
"missed_cleavages": 2,
"static_mods": {
"C": 57.0215
},
"decoy_prefix": "rev_",
"fasta": "my.fasta"
},
"precursor_tol": {
"ppm": [-10,10]
},
"fragment_tol": {
"ppm": [-20.0, 20.0]
},
"isotope_errors": [-1, 3],
"deisotope": true,
"chimera": true,
"output_directory": "output",
"max_fragment_charge": 2,
"process_files_parallel": true,
"mzml_paths": ["LQSRPAAPPAPGPGQLTLR.mzML"]
}

I keep running into the following error

Finished release [optimized] target(s) in 0.12s
 Running `target\release\sage.exe test.json`
[2022-09-06T20:50:54Z INFO  sage] generated 96410556 fragments in 12554ms
[2022-09-06T20:50:54Z WARN  sage] linear model fitting failed, falling back to poisson-based FDR calculation
Encountered error while processing LQSRPAAPPAPGPGQLTLR.mzML: The system cannot find the path specified. (os error 3)

Where should I keep the .mzML? I have also tried to put the full path (["C:/Users/user/sage/LQSRPAAPPAPGPGQLTLR.mzML"] to the .json, but still getting the same error.

Thanks
HT

Potential bug/regression

There around a 20% decrease in PSM count at 0.05 FDR level (and around a 10% decrease at 0.01 FDR) after commit da79dee for TMT analysis of PXD016766.

Unclear to me how this could be possible, given that the commit has a pretty small surface area - need to investigate

Support for tims tof data

Hi Michael,

is there a plan to support tims tof data (.d format) in general and ion mobility?

Best,

Tom

Output more than a single peptide for a scan

Is it possible to output the top 'n' peptides per scan, eg n of 10 or even 100? (I couldn't find the option in the configuration file to make this possible, so I'm assuming it's not possible at this point?)

Picked approaches for FDR are currently bugged

While messing around with FDR for LFQ/peak integration, I discovered that there is currently a bug in the implementation of picked-peptide. We need to recalculate FDR/q-values based on the maximum score of the picked-pair, not the hit score. Should also re-visit how much benefit there is to picked-approaches

Note that this only affects peptide FDR, not spectrum or protein.

Same dataset different results

Hi Michael,

Thank you very much for this great tool!

I ran Sage twice using the same parameters on the same dataset and it gave me different results each time.
Would you be able to explain why this occurs?

Thank you,
Tiago

RT prediction is not saturating CPU, occasionally failing

On a recent search with ~1500 files, RT prediction took > 10 minutes, and only achieved ~50% of possible CPU usage (and then failed to solve...)

Multiple hits of rank 1

I might be missing something here, but I was surprised to see two hits of rank 1 for the same scan:

scannr	peptide	proteins	rank	expmass	calcmass	charge	hyperscore
controllerType=0 controllerNumber=1 scan=28494	[+304.207]-ELDALDANDELTPLGR	sp\|Q08211\|DHX9_HUMAN	1	2045.078	2045.060	3	42.98145
controllerType=0 controllerNumber=1 scan=28494	[+304.207]-AIVAIENPADVSVISSR	tr\|A0A8I5KQE6\|A0A8I5KQE6_HUMAN;sp\|P08865\|RSSA_HUMAN	1	2045.063	2044.149	3	51.81654

Any idea? Thanks in advance.

Feature request -- add sage version to returned config

I really like how, as part of the output, sage also returns the config file - that's a great feature for reproducibility and tracking what was really run.

It would be even more useful if the version that was used added. This would allow to exactly repeat a previous run.

Thank you for considering!

Request for c/z ions detection

Dear Mike,

Absolutely FANTASTIC resource, and thanks for keeping it open!

Quick questions:

Where would I define the parameter for the inclusion of C/Z fragment ions? Say, I did EtHCD experiment and wish to search for the presence of C/Z and B/Y ions.
Would you be able to provide a template of the config/json file where we can find out the various param names available to us?

Many thanks,
James

Support additional modifications

Add support for:

Protein N-terminal modifications
Peptide C-terminal modifications

Quantitation

Could you clarify whether sage does TMT MS2 quantitation (none in our result files)? Re-reading the README file, it looks like it's only MS3.

Thank you in advance.

Typos in README sample parameters

The sample json parameters at the end of README.md seems to contain a couple of typos:

max_peaks is repeated:

  "max_peaks": 15,          // Optional[int] {default=15}: only process MS2 spectra with at least N peaks
  "max_peaks": 150,         // Optional[int] {default=150}: take the top N most intense MS2 peaks to search,

I believe the first one should read "min_peaks".

There is a comma missing after the first "10" in fragment_tol --> ppm, and an extra comma after the "3" under isotope_errors, both of which generate JSON parsing errors:

  "fragment_tol": {         // Tolerance can be either "ppm" or "da"
    "ppm": [
     -10                    // This value is subtracted from the experimental fragment to match theoretical fragments
     10                     // This value is added to the experimental fragment to match theoretical fragments
    ]
  },
  "isotope_errors": [       // Optional[Tuple[int, int]] {default=[0,0]}: C13 isotopic envelope to consider for precursor
    -1,                     // Consider -1 C13 isotope
    3,                      // Consider up to +3 C13 isotope (-1/0/1/2/3)
  ],

LFQ output is empty

Hey Michael,

I tried running a bigger Dataset with LFQ and wondered why the lfq.tsv table is empty.
The columns header gets generated but all rows are empty.
Do I need to specify something else in the config.json than setting lfq = True?

Best,
Thomas

Error: unknown variant `tmt`

Running the latest version from Github (version 0.10.0 based on the ChangeLog), I get the following error when performing TMT quantitation:

Error: Error("unknown variant `tmt`, expected one of `Tmt6`, `Tmt10`, `Tmt11`, `Tmt16`, `Tmt18`, `User`", line: 19, column: 13)

The exact same config file using sage-0.8.1 works. Here's the output:


$ ~/bin/sage/sage-0.8.1/target/release/sage ../extdata/tmt2.json
[2023-04-04T14:12:30Z INFO  sage] generated 17153170 fragments in 3800ms
[2023-04-04T14:12:30Z INFO  sage] processing files 0 .. 4 
[2023-04-04T14:12:44Z INFO  sage]  - file IO:     4367 ms
[2023-04-04T14:12:44Z INFO  sage]  - search:      9699 ms (157586 spectra)
[2023-04-04T14:12:44Z INFO  sage] processing files 4 .. 8 
[2023-04-04T14:12:58Z INFO  sage]  - file IO:     4379 ms
[2023-04-04T14:12:58Z INFO  sage]  - search:      9562 ms (158463 spectra)
[2023-04-04T14:12:58Z INFO  sage] processing files 8 .. 12 
[2023-04-04T14:13:12Z INFO  sage]  - file IO:     4272 ms
[2023-04-04T14:13:12Z INFO  sage]  - search:      8993 ms (154619 spectra)
[2023-04-04T14:13:15Z INFO  sage] discovered 95065 peptide-spectrum matches at 1% FDR
[2023-04-04T14:13:15Z INFO  sage] discovered 75903 peptides at 1% FDR
[2023-04-04T14:13:15Z INFO  sage] discovered 10390 proteins at 1% FDR
[2023-04-04T14:13:17Z INFO  sage] finished in 50s
{
  "database": {
    "bucket_size": 16384,
    "enzyme": {
      "missed_cleavages": 0,
      "min_len": 5,
      "max_len": 50,
      "cleave_at": "KR",
      "restrict": "P"
    },
    "fragment_min_mz": 150.0,
    "fragment_max_mz": 1500.0,
    "peptide_min_mass": 500.0,
    "peptide_max_mass": 5000.0,
    "min_ion_index": 2,
    "static_mods": {
      "C": 57.0215,
      "^": 229.1629,
      "K": 229.1629
    },
    "variable_mods": {},
    "max_variable_mods": 2,
    "decoy_tag": "rev_",
    "generate_decoys": true,
    "fasta": "/mnt/isilon/CBIO/data/SCPCBIO/fasta/UP000005640_9606.fasta"
  },
  "quant": {
    "tmt": "Tmt11",
    "tmt_level": 2,
    "lfq": null
  },
  "precursor_tol": {
    "ppm": [
      -20.0,
      20.0
    ]
  },
  "fragment_tol": {
    "ppm": [
      -10.0,
      10.0
    ]
  },
  "isotope_errors": [
    -1,
    3
  ],
  "deisotope": true,
  "chimera": true,
  "min_peaks": 15,
  "max_peaks": 150,
  "max_fragment_charge": 1,
  "report_psms": 1,
  "predict_rt": true,
  "parallel": true,
  "mzml_paths": [
    "/home/lgatto/.cache/R/rpx/3ff35e23e0b2c0_3ff35e6e68a154_dq_00082_11cell_90min_hrMS2_A1.mzML",
    "/home/lgatto/.cache/R/rpx/3ff35e23e0b2c0_3ff35e32eb78de_dq_00083_11cell_90min_hrMS2_A3.mzML",
    "/home/lgatto/.cache/R/rpx/3ff35e23e0b2c0_3ff35ee5d5a37_dq_00084_11cell_90min_hrMS2_A5.mzML",
    "/home/lgatto/.cache/R/rpx/3ff35e23e0b2c0_3ff35e3fb19375_dq_00085_11cell_90min_hrMS2_A7.mzML",
    "/home/lgatto/.cache/R/rpx/3ff35e23e0b2c0_3ff35e790c7b57_dq_00086_11cell_90min_hrMS2_A9.mzML",
    "/home/lgatto/.cache/R/rpx/3ff35e23e0b2c0_3ff35e72d009e6_dq_00087_11cell_90min_hrMS2_A11.mzML",
    "/home/lgatto/.cache/R/rpx/3ff35e23e0b2c0_3ff35e50a7dd0c_dq_00088_11cell_90min_hrMS2_B1.mzML",
    "/home/lgatto/.cache/R/rpx/3ff35e23e0b2c0_3ff35e21356dc_dq_00089_11cell_90min_hrMS2_B3.mzML",
    "/home/lgatto/.cache/R/rpx/3ff35e23e0b2c0_3ff35e7c19e7c6_dq_00090_11cell_90min_hrMS2_B5.mzML",
    "/home/lgatto/.cache/R/rpx/3ff35e23e0b2c0_3ff35e78048db8_dq_00091_11cell_90min_hrMS2_B7.mzML",
    "/home/lgatto/.cache/R/rpx/3ff35e23e0b2c0_3ff35e484bf56b_dq_00092_11cell_90min_hrMS2_B9.mzML",
    "/home/lgatto/.cache/R/rpx/3ff35e23e0b2c0_3ff35e127a1e80_dq_00093_11cell_90min_hrMS2_B11.mzML"
  ],
  "output_paths": [
    "output2/results.sage.tsv",
    "output2/quant.tsv",
    "output2/results.json"
  ]
}

Binaries built at release with platform names

Hi,

Currently the binaries from the release page have the name of the platform they were compiled for (Linux.64-bit) instead of something like sage.

Writing output files takes too long, uses too much memory

Rework decoy generation

https://github.com/lazear/carina/blob/fe851820526bd5c53f092968433ac788c5c64ca4/src/database.rs#L102-L110

I suspect this code block is potentially removing some real target peptides if a matching decoy was previously generated. Should refactor this into separate forward & reverse collections steps - and track relative ratio of target:decoy for accurate FDR calculation

Support additional enzymes

Add support for:

Enzymatic cleavage N-terminal to sites
Non-specific digest
No digest

Validation

How do you check the hits? A visual display of the spectra and ion assignments is very useful. Did you use something in development? I see a high number of chimeric spectra returned. I would like to look at these spectra assignments to think about the appropriate scoring thresholds.

Decoy hits have label -1?

Not sure where to find the documentation about the sage.result.tsv file, but could you confirm whether the label variable corresponds to decay (-1) and forward (1) hits.

Thank you!

Subtle difference in # of matched peaks between prelim and full score

Can be observed in tests/LQSRPAAPPAPGPGQLTLR.mzML file.

Several cases (lowest 2 ranked PSMs) where 0 matched peaks are reported for the PSM (despite prelim score filtering for > 0 peaks, and 1 "longest b ion" reported), causing average_ppm to be NaN.

This could be due to a subtle difference in the matching algorithm for specific fragment generation. I extensively tested and confirmed exact match between old full-scoring and preliminary/full scoring split score for the top match for some 40k spectrea, so this is likely only to rear it's head in edge cases.

{
  "database": {
    "fragment_min_mz": 135.0,
    "fragment_max_mz": 1350.0,
    "peptide_min_len": 6,
    "peptide_max_len": 65,
    "2issed_cleavages": 2,
    "static_mods": {
      "C": 57.0215
    },
    "variable_mods": {
      "M": 15.9949
    },
    "decoy_prefix": "rev_",
    "fasta": "2022-08-16-decoys-reviewed-contam-UP000005640.fas"
  },
  "precursor_tol": {
    "ppm": [-20, 20]
  },
  "fragment_tol": {
    "ppm": [-10.0, 10.0]
  },
  "isotope_errors": [-1, 3],
  "deisotope": true,
  "max_fragment_charge": 2,
  "report_psms": 500,
  "mzml_paths": [
    "tests/LQSRPAAPPAPGPGQLTLR.mzML"
  ]
}

Enable integration with PeptideShaker

Rename discriminant_score to sage_discriminant_score
Add a column indicating the rank of a PSM for a given spectrum
Investigate the performance impact of tracking spectrum ID along with/in lieu of scan number

Reference conversation at: compomics/searchgui#334

Version cloned is different from most recent release (v0.8.0)

Hi there,

Very cool project and exited to test it out.

I simply used git clone to clone and build the repo on Windows 11. When I check the version with --version it says v0.7.1.
Was wondering if I was doing something wrong or should be downloading and building from the 0.8.0 zips?

Thanks!

Support for feature extraction

Hello again!

Another thing I've frequently struggled to find in the existing mass-spec toolset is support for standalone feature extraction. In my head, this would simply pull out all of the processing that Sage does before the actual ion searching and dump that processed information as a peak list with data somewhat similar to MaxQuant's allPeptides.txt.

While it seems like MaxQuant has a million and ten tricks up its sleeve, that's been part of the problem when I'm just looking for a simple list of:

Scan number
Masses (not sure if Sage does calculates a mean and standard deviation of this? From the centroiding but also throughout retention times?)
Charges (predicted from the deisotoping)
Abundances (Ion counts + Intensities, I think this is the job of LFQ?)
Retention Times (+ XIC start and end)
I suppose eventually I'd want to investigate something more for MS/MS (parent scan number, etc)

I'll admit I'm still learning a lot of this myself, and I found most of the information about "feature finding" from this video: https://www.youtube.com/watch?v=H_vClGghnNo

Even if some of the "fancier stuff" (DIA, etc) preprocessing is a bit out of scope for Sage, having a quick, embeddable (as a library) tool for converting mzML to deisotoped peak-lists would be outstanding!

Let me know what you think! Happy to help with this eventually too :)

Why is deisotoping optional?

Dear Michael,

I was going through the example config file and saw that deisotoping is set to false. I thought deisotoping is essential to accurately quantify the peptides. Is there a reason to switch it off?

Thanks
Maithy

Support for SPS-MS3 quantification

Hi, I recently started working on putting together a TMT pipeline and am evaluating Sage as well as Comet/OpenMS/custom approach. My question has to do with processing SPS-MS3 data. I can see you've got this in the works but wanted to open a discussion after encountering an error when attempting to process a file using IsobaricAnalyzer.

Would sage also have issues with quantifying data from these mzMLs which lack isolation window offsets for SPS masses? Here's an example:

<isolationWindow>
  <cvParam cvRef="MS" accession="MS:1000827" value="577.298034667969" name="isolation window target m/z" unitAccession="MS:1000040" unitName="m/z" unitCvRef="MS" />
  <cvParam cvRef="MS" accession="MS:1000828" value="-0.5" name="isolation window lower offset" unitAccession="MS:1000040" unitName="m/z" unitCvRef="MS" />
  <cvParam cvRef="MS" accession="MS:1000829" value="-0.5" name="isolation window upper offset" unitAccession="MS:1000040" unitName="m/z" unitCvRef="MS" />
</isolationWindow>

Thank you for your continued work on sage!

Specifying Temporary Directory

Is there an option to specify a temporary directory for sage? I don't see it in the --help. It also does not appear to be integrated with SearchCLI? Not sure if I am running these commands correctly, but I cannot process a very large mzML file without the use of a temporary directory.

Below is what I have tried using the conda installation of SearchCLI:
searchgui eu.isas.searchgui.cmd.IdentificationParametersCLI -id_params myparameters.par -db myproteins.fasta -spectrum_files "/mnt/data/large.mzML" -sage_folder ./sage -sage 1 -search_engine_temp /mnt/data/temp/

I have also tried with the sage binary, but -temp_folder does not appear to be a command and I don't see anything listed for it in the example json.
./sage myparameters.json -temp_folder /mnt/data/temp/

Supports for cross-linking searching?

Hi Michael,

Thanks for this excellent project!

I have a specific problem and would like to know whether I can achieve it by modifying sage here.

I have a protein where many residues are randomly labeled by a known peptide. The linker is known. I have the LC-MS/MS data of this molecule (Trypsin degist). Now I want to find which residues of the protein are labeled by the peptide, i.e. to find the label positions.

So in general, it's more like a cross-linking searching task, but currently I haven't got an idea how to do it in sage.

Since Sage is the only project I can find which is totally open-source and friendly for coders, do you have idea when to support the crosslinking searching or can you give me some hints if I modify sage to my projects by myself?

Thanks in advance!

Benchmarking datasets

On the sage web page, you mention several experiments that were used for various benchmarks:

PXD016766 for TMT search performance
PXD001468 to benchmark open search performance and chimeric search
PXD020815 for TMT quantification

I think it would be really helpful to provide the parameters (and possibly even the outputs) or these benchmarks.

The reason I am asking this is because I wanted to create a test dataset for some of my sage-based testing and development and I wanted to re-use some of the above. Instead finding the right search parameters and re-running things, at the risk of introducing mistakes or diverging from your results, it would be beneficial to be able reproduce exactly what you did to harmonise outputs and get more consistent results.

Handle missing precursor charge information

It is possible to have precursors that are not annotated with charge state - Sage will currently panic/abort if these are found. A reasonable solution is to assume a charge state of 2/3/4, search all three charge states, and merge the results

TMT11 quantification on MS2 level?

Hi, thank you for such a magic search, it is fast without any sacrifice on identification. And it does quantification as well.
my test result on label-free is amazing.
My recent test on TMT11 on MS2 level (HCD) from thermo Exploris 480 data seems not working properly. The identification seems good, just with all channels 0 for quantification.
Wondering have you ever tested data like this?

Convert mgf to mzml

I have mgf files only (no mzml). I'm thinking of using msconvert (on linux) to convert the mgf to mzml....But I know that some programs don't always work 100% properly when that occurrs (due to some expectaion of what should be in a mzML file etc.) Is there any known/guessed issue that might arise if I take a reguar/simple mgf file and try to convert it to mzML? (The reason for this need: the mgf is a pseudo generated file, and it's generated off of DIA data in a manner similar to how DIAUmpire works.)

Test does not run due to missing .fas file

When attempting to run the post-installation test:

cargo run --release tests/config.json

it fails with a No such file or directory error since the referenced .fas file 2022-07-23-decoys-reviewed-UP000005640.fas is not included in the distribution.

Move toward proforma compliance in open mod searches

Hello there!

I was wondering if you have considered moving to a more standard way of reporting the peptide sequences from the search engine. I noticed that there has already been a shift (from parenthesis to brackets) when peptideshaker-support was added. I think this would be a great addition to allow the usage of the data in downstream applications!

In particular I am facing an issue where open searches get reported in the last aminoacid, which
makes it ambiguous to know whether the mod is indeed in the terminal position or unknown in
location.

# Current way of reporting an open mod with a variable mod
AWEIRDPEPTIDEM[+15.9949][+xx.xxx]

# Suggested Proforma compliant way of reporting it, making explicit the location
# is unknown
[+xx.xxxx]?AWEIRDPEPTIDEM[+15.9949]

LMK what you think!
Thanks again for the amazing search engine
-Sebastian

Leaving here the spec document for later.
https://github.com/HUPO-PSI/ProForma/blob/master/SpecDocument/ProForma_v2_draft15_February2022.pdf

How to install?

Hi Michael,

Many thanks for the fantastic search engine, and congrats!
Do you think you could share some details on how to install and use it? Is it for Linux or Windows? Just unfamiliar with Rust, not sure how to start using Sage :)

Best,
Vadim

lazear / sage Goto Github PK

sage's Introduction

Sage: proteomics searching so fast it seems like magic

Introduction

Why use Sage instead of other tools?

Citation

Features

Interoperability

sage's People

Contributors

Stargazers

Watchers

Forkers

sage's Issues

Recommend Projects

Recommend Topics

Recommend Org