Human-plasma_DIA-vs-TMT

An apples-to-aardvarks comparison of human plasma proteomes from DIA versus TMT

Introduction

I was working with some DIA data to look for any proteins with differential abundances. I had not worked with any DIA dataset previously. The samples were non-depleted human plasma. I also have no prior experience with non-depleted plasma data. There is a reason that so much effort went into abundant protein depletion methods for human plasma. This was back in the early days after we had survived the impending doom of Y2K. The issues with looking for biomarkers in human plasma is as old as proteomics itself with a highly recommended paper from 2002 for the curious. Granted, analytical platforms (LC-MS) have improved some since the turn of last century, but the dynamic range challenge of plasma is more fundamental that just instrument capabilities.

The DIA data was 25 samples run in a single-shot experimental design (some details are mentioned below). The most similar human plasma data that I had analyzed was a 56-sample TMT experiment from 2020 with depleted human plasma (Thermo High-Select Top14 Abundant protein depletion mini spin columns). The TMT experiment was 35 high-pH reverse-phase (RP) fractions, followed by 2-hour low-pH RP separations. About all these two datasets have in common are that the samples are human plasma. Major differences are: DIA versus DDA, label-free versus TMT labeling, non-depleted versus depleted, single shot versus extensive fractionation, methods of data analysis, etc.

Why would any sane person try to compare these two datasets? One, I am probably not all that sane. Two, I actually believe in science, reality, and truth. Both experiments are characterizing the sample kind of sample (human plasma). I think that different experiment methods to characterize similar things should provide similar "pictures". The differences should be more like camera filters or image processing filters where the pictures would differ in appearance, but you would know what the subject was. I do not think that two proteomic methods to characterize human plasma should be something like the blind men and an elephant.

What proteome abstractions need to be created from each dataset to compare? We need to compare forests (maybe large trees in bulk), but certainly not the weeds beneath the underbrush. Despite proteomics being all about large sets of proteins and their properties, we almost always think about and compare things one protein at a time. It is like our methods have evolved way beyond the traditional biochemistry isolate and purify one-protein-at-a-time approaches, but our brains have not. Time for the evergreen joke that we don't need multi-omics methods, we need single-omics methods.

Let's begin this crazy journey down the very rocky and overgrown trail and see if we can get somewhere interesting before we run out of steam and decide to turn back (or be evacuated by helicopter). I have to change names to protect the innocent (Dragnet TV show reference). Both projects are still ongoing, so I am not at liberty to disclose any particulars. This is not a biology exercise. It is more an informatics thought experiment (a "Gedankenexperiment").

Experiment overviews

DIA analysis of non-depleted human plasma

There were 5 subjects with blood draws at 5 time points related to exercise. The subjects are not really a single group (there are some genetic differences), but repeated measures can be done with non-invasive sample collections. This is not what I would consider a typical proteomics experimental design. Plasma was separated from the blood. There was no depletion of any highly abundant proteins like serum albumin. The samples were digested in S-TRAP 96-well plates (reduction/alkylation, digestion with trypsin), run in single-shot EVOSEP LC separations (not sure of gradient length) on a Thermo Exploris Orbitrap (not sure of model). I do not have any details on the DIA settings. The facility that produced the data is experienced and respected, so I suspect this was a good DIA setup.

The data analysis used a twin human plasma library in an openSWATH/pyprophet/TRIC pipeline with protein summarization from mapDIA. The reported proteins had human Swiss-Prot identifiers. Alternative ways the data could have been processed may be possible. What I started with was the table of identified proteins (rows) for the 25 samples (columns) where the quantitative values (cells) were protein total intensities (sums of peptide assays that are sums of transition intensities). I think the general idea is that the libraries have protein assays (a set of filtered peptides [usable for quant] and each peptide has sets of MS2 transitions and retention timse on a normalized scale). It is not clear to what extent the peptides for each protein capture the total abundance of that protein. They are likely selected for good assay reproducibility.

I am not 100% sure, but I think questions of protein inference and which peptides can be used for quantification are addressed during construction of the assay library. OpenSWATH seems to be more of a query of the DIA data for quantitative information for a specific set of assays in a library rather than extraction of all quantitative information from the data followed by some protein-level summarization process.

TMT analysis of depleted human Plasma

Blood draws from 4 groups of 7 female subjects per group were done where the groups were related to pregnancy outcomes. Blood was collected at two time points during pregnancy for a total of 56 samples. The samples were depleted of the top 14 proteins using Thermo spin columns. Typical reduction/alkylation, trypsin digestion, and TMT labeling was done. An IRS experimental design was used (two pooled reference channels per plex) with 16-plex TMT reagents to accommodate the 56 samples in 4 plexes. Each plex was processed with 35 fractions using a Thermo Fusion Tribrid instrument and SPS-MS3 reporter ion intensities.

The PAW pipeline was used for peptide ID with the Comet search engine. A canonical human reference proteome (one protein per gene) was used (20.6K sequences) with wide precursor tolerance, semi-trpytic cleavage, variable oxidized Met, and fixed TMT labels at peptide N-term and lysines. Target/decoy methods were used to control PSM FDR. Accurate deltamass (differences between measured and compute peptide masses) are used to create conditional meta-score distributions for peptides with 0-Da mass errors (most peptide matches), 1-Da mass errors (deamidation of Asn and C13 mis-selected precursors), and any other mass errors. Conditional score distribution histograms of target matches and of decoy matches are used to set meta-score cutoffs and control PSM FDR at typically 1%.

Most publications are especially bad at distinguishing PSMs (MS2 scans in DDA) from peptides when talking about target/decoy FDR. The statistical estimate of the number of random, incorrect target PSMs using the count of decoy PSMs applies to PSMs not peptides. Precise language is important in science.

Protein inference does basic parsimony peptide set analyses (indistinguishable sets combined into groups, formal peptide subsets removed). An additional extended parsimony analysis is done where proteins that have mostly indistinguishable peptide evidence and insufficient distinguishable peptide evidence are combined into protein families. Grouping into protein families removes most variability associated with FASTA sequence collection choices. The two peptide per protein rule is applied per "sample". Proteins make the final list if they have 2 peptides or more in any sample in an experiment. Data for that protein is reported in all samples even if the two peptide rule was not met for that sample.

Note that TMT experiments have the actual biological samples hidden from the peptide identification and protein inference precesses. A TMT plex (a set of samples labeled by a TMT reagent kit) behaves like a biological "sample" for protein inference.

After the final list of identified proteins or protein groups or protein families has been determined, then an analysis of which peptides are unique or shared with respect to the final protein list can be done. Peptides that are unique in this final protein context can be used for quantification. Total protein relative abundances are estimated as the sum of all unique PSM reporter ion intensities.

Quantification in bottom-up proteomics experiments is intimately coupled to protein inference if protein-level summaries are generated. Understanding the proper protein context for determining which peptides can be used for quantification is critical. Quantification is not a process that starts with the lowest level data (individual instrument scans) and is propagated up to protein-level values. It is more of a top-down process where one starts with proteins, determines which peptides can be used for quantification, and then the scan-level data associated with the usable peptides can be aggregated.

Errors in quantitative bottom-up proteomics most frequently involve poor FASTA choice (excessive peptide redundancy), insufficient protein grouping given today's large data sets (there will be incorrect peptides mapped to correct proteins), not understanding what protein context to use for determining which peptides can be used for quantification (this can result in many usable peptides being excluded from quantification), and not using actual quantitative measurements to compute relative abundances (signal-to-noise ratios are not measured values).

Comparison metrics

Number of identified proteins

About the only omics-like characterizations of proteomics experiments that we seem to consistently report in publications and data analysis summaries is the number of identified proteins. And that is usually not done well. What is typically reported is the union of all protein IDs over all samples in an experiment. This is an error-inflated number that is rather useless. We can do better.

DIA experiment

The starting protein-level data summary had 1014 protein rows. One entry was the iRT standards and there were 18 other proteins that were hemoglobins (3) or keratin (15). The left 995 protein IDs from the union of IDs across the 25 samples. The number of proteins with reported quantitative data per sample was around 500 (average of 507 proteins with a standard deviation of 27 proteins). The average number of identified proteins per "sample" is a far better metric for proteome depth for a given proteomics method. There are still issues related to inflated protein IDs (large FASTA files, single peptide per protein IDs), but this avoids the accumulation of low confidence likely non-quantifiable proteins in the union of IDs.

There are always more identifiable proteins than quantifiable proteins in any experiment. The average number of IDs per sample will stabilize and be independent of the number of samples. The union of all IDs keeps growing with the number of samples. This creates an expectation gap where the difference between the number of IDs and the number of quantifiable proteins seems worse in experiments with more samples. We need to look at some other data characteristics explored below to determine what proteins are quantifiable in these survey experiments.

TMT experiment

The 56-sample TMT experiment was performed in 4 TMT plexes (one set of TMTprot 16-plex reagents). There were about 600K MS2/MS3 spectra acquired per plex. The number of matched scans at a 1% FDR was about 95K per plex. The number of protein IDs per plex (excluding decoy and contaminant matches) were 916, 961, 951, and 996, respectively. The average was 957 with a standard deviation of 33. There was an average of 100 PSMs mapped to each protein; however, PSMs per protein have a very skewed distribution. Depletion of major blood proteins and the extensive fractionation only increased the protein identification depth by a factor of 2. The union of all IDs was about 1,250. The union of IDs is closer to the average number of IDs here because we only have 4 effective "samples" (the plexes) and the extensive fraction increases the odds of seeing proteins consistently between "samples".

Number of quantifiable proteins and missing data

There are definitions of limits of detection and quantification in true mass spec assays. Those do not apply (very well) in these survey experiments. What drives the number of proteins we can attempt to quantify in practice are missing data. You cannot quantify what you cannot measure. You cannot compute imputation values out of thin air. We have to understand the nature of missing data to figure out what subset of the protein IDs have sufficient measurements for abundance estimates.

DIA experiment

As I said, I do not work with DIA data much. Everything I have read mentions how DIA is not dependent on random selection of ions in DDA, so quantification is much more reproducible between samples. My brains translates those statements into an expectation that there will be less missing data in DIA datasets compared to DDA label-free methods.

There were just 157 proteins that had reporter intensities in all 25 samples. Of the cells in the 995 row by 25 column table, 49% were missing. The missing data dependence on protein abundance rank (average of protein intensities over the 25 samples) was a nearly linear function. The top half of the abundance ranked proteins had 27% of the total number of missing values and the bottom half had 73%. The pattern I have seen in DDA datasets for spectral counting and MS1 label free quant typically have very little missing data for highly abundant proteins and a more pronounced concentration of missing data in the lowest abundance ranked proteins. I do not know if this unusual pattern for the DIA data is due to DIA or to non-depleted plasma. The top 500 proteins by average protein intensity abundance ranking were used as candidates for imputation of missing values.

Missing value imputation is complicated. I took a sensible approach. First, divide the samples into the biological groups. For each protein in each subset of samples, if there were a majority of observed values, use the average of the observed values as a missing value replacement. The data clustered by biological subject more tightly than by time point, so subjects were considered the biological groups for this step.

That makes 5 groups of 5 observations. If there were at least three measured values, group average imputation was done. Since this is at most imputing two values, a random jittering of the average was not done (replacing two missing values with the same average value does artificially reduce variance; the imputation could be improved). After imputation, proteins were filtered for full data grids (all 25 values after imputation). The final number of quantifiable proteins was 236 (23 of those were immunoglobulin sequences).

Many biological fluids contain immunoglobulins. There are IgG complexes (2 heavy chains and 2 light chains) in circulating blood and in any fluids that have serum infiltrate components. Many other fluids have secretory IgA complexes. The representation of immunoglobulins in UniProt reference proteomes is tricky. There are a few hundred heavy and light entries and not much that helps distinguish Ig complexes from each other. Any list of proteins detected in body fluids will likely have many immunoglobulin protein IDs that should probably be lumped together into some sort of a total Ig abundance. It is really messy.

This approach to imputation of missing values will exclude situations where proteins were present in some biological conditions (groups) and not present is other biological conditions. However, present or not present testing is a conditional testing and falls outside of the data imputation prior to hypothesis testing realm. Present or not present conditional testing depends on study design, prior knowledge of biological plausibility, sample processing QC (no unusual behaviors specific to groups, low levels of contaminants that are somewhat consistent across samples/groups, etc.), LC-MS QC (consistent spray, TIC, peak shapes, etc. between samples/groups), and data QC (consistent normalization factors, expected clustering by groups, reasonable sample-to-sample consistency within groups, etc).

When all of these QC metrics have been checked and met, then what constitutes the practical lower level of quantitative values can be determined. Some quantitative extraction algorithms may return zero or missing values, or they may return integrated noise (typically smaller values that are not zero or missing). The question of present or not present boils down to demonstrating something like, "protein X was clearly present in group A and clearly not present in group B". "Clearly present" requires knowing the noise level and picking some minimum signal far enough above the noise to be convincingly present. Similarly, the level that can be called not present also needs to be convincingly determined. All this is very case dependent and does not generalize into a data imputation algorithm.

TMT experiment

TMT datasets have very different missing data characteristics (and much less missing data). Within each of the 4 TMT plexes, there is very little missing data at the aggregated protein level (remember we average 100 PSMs per protein). If we can ID a protein (with the two peptide rule), we almost always have reporter ion intensities (there are a few IDs without any reporter ions). The samples are distributed across the 4 plexes, though, and we have to combine the data before we can do any statistical testing.

The internal reference scaling (IRS) experimental design uses duplicate pooled standard channels in each plex to scale all protein intensities to a common measurement scale. That requires that there are intensity values for proteins in all reference channels in all of the 4 plexes to be able to do the mathematics. That reduces the number of quantifiable proteins a little. There were 897 quantifiable proteins after IRS. This is not much below the average of 957 protein IDs per plex. There were only 0.7% missing values associated with the 897 proteins. 817 proteins had full sets of reporter ion intensities. Missing values are concentrated in the lower abundance proteins and zero value replacement with small fixed values works well.

Abundant protein depletion and extensive fractionation could only increase the protein ID number by a factor of 2, but it increased the number of quantifiable proteins by a factor of 4. The quality of the quantitative data was also better with TMT. The DIA experiment has repeated measures at 5 times from the same 5 subjects. The samples in the TMT experiment were more independent subjects (n=7) in the 8 groups (4 pregnancy outcomes and 2 time points). The average median coefficient of variance (CV) in the TMT experiment was about 22%. The median CV over the time points in the DIA data was 42%. The TMT data is SPS-MS3 so it is less likely to suffer from reduced CVs because of reporter ion dynamic range compression (what distorts MS2-based reporter ion measurements [a.k.a. ratio compression]).

Does the DIA "proteome" resemble the known plasma proteome?

Before we can say if two proteomes are similar or different, we have to have a working definition of what a proteome is. I think of proteomes as lists of proteins with associated relative abundance measures. We may want to average abundance data from biological samples to get more reliable "average" proteomes. We may be able to average over multiple biological groups depending on how similar the groups seem and what the particular biological question might be. Clearly, some biological domain knowledge is needed to make these decisions and the working definition of proteomes will depend on many factors that may not be particularly generalizable.

The idea of a proteome needs to go beyond some simple comparison of protein identifications (think of the typical Venn diagrams). We need to incorporate the relative abundance dimension. There are many options here. We can use relative abundance estimates to rank the protein lists by decreasing abundance. We can compare protein ranks. That would work better is the lists are similar in length; however, they seldom are.

Bottom-up proteomics is really more a case of top-down peptidomics. Everything starts with the highly abundant peptides from the highly abundant proteins. They define the ceiling. They determine the saturation of LC-MS platforms and limit peptide digest loading amounts. The whole analytical platform has to work through these abundant peptides in its quest to get to the bottom of the proteome. Each proteomic experiment (experimental design, bench steps, LC, mass spec, data acquisition, and data analysis) has different inherent floors (how far into the proteome one gets [where the direction is towards lower abundance peptides/proteins]). It is really only the "tops" of proteomes that can be compared; the "bottoms" of proteomes are ragged and of varying lengths.

One way to use relative protein abundance dimensions in proteomes is to compute fractional relative abundances of a total proteome abundance (parts per million or parts per billion might be good choices to avoid fractions less than 1). Since we always start at the top of the peptide abundance world and work our way down in shotgun proteomics, we will likely have covered a high fraction of the total proteome by the time we get to the "floor" of any particular experiment. The total of the proteome that might be missing will be a small fraction of the total and have little effect on the fractional relative abundances of the observed proteins. We will end up with fairly standardized relative abundance scales for observe proteins that we can compare between experiments, to literature, etc.

There is a "known" human blood/serum/plasma proteome that has been compiled from many decades of literature that has been summarized nicely by Sigma Aldrich. After removal of red blood cells and clotting factors, we have non-depleted human plasma that is what the double pie chart shows in the above link. This is a gene-centric relative abundance summary and we may have to combine some of the proteomics results to get comparable gene-centric values (like the immunoglobulins).

The obvious first check is serum albumin relative abundance that should be about 60% of the total non-depleted plasma proteome. The total average intensity per sample of the 995 proteins in the DIA results table was just over 590 million. Serum albumin was to top ranked protein (yeah!) with 56.4M (10.2% of the total signal). The next most abundant protein was apolipoprotein A-1 with an average intensity of 55.8M and the sum of immunoglobulin entries was 62M. There was about 5 decades of dynamic range in the average total protein intensities.

Clearly, serum albumin abundance is under estimated by the openSWATH assay. Albumin should be about 20 time more abundant than apolipoprotein A-1 and maybe 3-4 times the total immunoglobulin levels. The openSWATH assays may not be designed to give a proper reflection of the true plasma proteome relative abundances. We do have some concordance with expected known top plasma proteins. They are near the top of the ranked list in the DIA results. I do not know how much the distortion of the plasma proteome is due to openSWATH data analysis versus DIA in general.

Conclusions

There are so many differences between the DIA experiment and the TMT experiment that any conclusions are weak at best. The two major differences are whether or not major blood proteins were depleted, and single-shot versus fractionated LC-MS. The affects of depletion depend on the depletion technique used, of course. Most kits say what general gene products are depleted but we do not have any detailed mapping between those genes and the protein entries we can have in protein FASTA files (and there are many source of FASTA sequence collections). We have some vague idea of what the high abundance protein composition of non-depleted human plasma looks like. We do not have similar information for depleted plasma proteomes. This makes comparing non-depleted plasma (DIA) to depleted plasma (TMT) a challenge.

I do not know how single shot proteomes compare to highly fractionated proteomes. When fractionation was developed some 25 years ago, numbers of protein IDs is what was explored. We did not have much in the way of measuring protein relative abundance methods back then. I do not know if anyone has looked at how similar or different single shot proteomes compare to fractionated proteomes (I suspect not). Yeast would be a good system for this since it is one of the few proteomes with some ground truth. I think this is an interesting question that is not easy to answer. How does LC gradient length affect the proteome for single shot? What about instrument scan speed and cycle time? Are there different types of LC separations to explore? How does DDA versus DIA affect things? How different are proteomes with different DIA parameters (window schemes, library or library-free, software choice, etc.)? How does 6-8 fractions compare to 30-40 fractions? How similar or different are alternative first dimension fractionation methods (SCX vs High pH RP, etc.)? What about different label free methods in DDA (spectral counting, MS1 peak heights/areas, etc.)? It would be nice to have at least a few of these methods mapped out so we know how to match proteomic methods to biological experiments.

My first impression of this DIA data left me unimpressed. Proteome depth was limited but I found the missing data situation more troubling. I expected there to be less missing data in a DIA label free experiment compared to DDA label free experiments. Have you really done a quantitative proteomics experiment (either DDA or DIA) if you have not measured anything half of the time? The level of missing data in most label-free experiments seems kind of embarrassing to me.

I have invested a huge amount of time and effort into TMT data processing to get to quantitative quality that I feel comfortable with. I have a lot of skin in that game. There will be many readers of this with similar skin in the DIA game. It is clear to me that DIA has made tremendous improvements in the last decade. I am far from convinced that many of the depths of quantitative protein profiling being reported are actually possible, though. At the end of the day, we have the same LC system limitations and mass spectrometer limitations whether we used DIA or DDA. Too good to be true usually means exactly that.

Phil Wilmarth
February 7, 2023

pwilmart / human-plasma_dia-vs-tmt Goto Github PK