Giter Site home page Giter Site logo

flowworkspace's Introduction

flowWorkspace: An infrastructure tool for the hierarchical gated flow cytometry data.

Build Status

This package is designed to store, query and visualize the hierarchical gated flow data.

It also facilitates the comparison of automated gating methods against manual gating by

importing basic flowJo workspaces into R and replicate the gating from flowJo using the flowCore functionality. Gating hierarchies,

groups of samples, compensation, and transformation are performed so that the output matches the flowJo analysis.

Reporting Bugs or Issues

  • Use the issue template in github when creating a new issue.
  • Follow the instructions in the template (do your background reading).
  • Search and verify that the issue hasn't already been addressed.
  • Check the Bioconductor support site.
  • Make sure your flow packages are up to date.
  • THEN if your issue persists, file a bug report.

Otherwise, we may close your issue without responding.

INSTALLATION

# First, install it from bionconductor so that it will pull all the dependent packages automatically
biocManager::install("flowWorkspace") # may be older
# Then, install the latest version from github using devtools package 
install.packages("devtools") 
library(devtools) #load it
install_github("RGLab/flowWorkspace")

Import flowJo workspace

library(flowWorkspace)
dataDir <- system.file("extdata", package="flowWorkspaceData")
wsfile <- list.files(dataDir, pattern="manual.xml",full=TRUE)
ws <- openWorkspace(wsfile);
gs <- parseWorkspace(ws, path = dataDir, name = 4, subset = "CytoTrol_CytoTrol_1.fcs")
gs

#get the first sample
gh <- gs[[1]]
#plot the hierarchy tree
plot(gh)
#show all the cell populations(/nodes)
getNodes(gh)
#show the population statistics
getPopStats(gh)
#plot the gates
plotGate(gh) 

More examples:

flowworkspace's People

Contributors

bjreisman avatar dillonhammill avatar djhammill avatar dtenenba avatar gfinak avatar hpages avatar jacobpwagner avatar jwokaty avatar kayla-morrell avatar kevinushey avatar mikejiang avatar nturaga avatar ramhiser avatar vobencha avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

flowworkspace's Issues

add rbind method to ncdfFlow and flowWorkspace

It will be handy to have combind method to take a list of ncdfFlowSets/gatingSets instead of doing 2 at a time by rbind2. Meanwhile it is going to resolve the disk space issue caused by repetitive copying cdf files during multiple rbind2 call.

plotting and naming of plotted gates

Write a new plotting routine for LabKey that:

For plots exported to LabKey

  1. Plot multiple gates using the same dimensions and the same parent population on the same graph
  2. Name the plot according to the parent population not the gated populations
  3. Use the order of dimensions defined in the gate not as they appear in the flowFrame.

Error while creating a GatingSet

Using
R: 2012-03-19 r58787

flowWorkspace:
branch: cpp_devel (70)

On Blackrhino navigate to the folder: "/loc/no-backup/mike/ITN029ST"

Fire up an R session and input the following commands:

suppressMessages( library(QUALIFIER) );
ws<-openWorkspace("QA_MFI_RBC_bounary_eventsV3.xml");
GT<-parseWorkspace(ws,execute=FALSE,useInternal=TRUE,subset=c(1));
2 ( <- choice in the interactive prompt)
gh_template<-GT[[1]];
fls <- list.files(pattern='*.fcs');
G<-GatingSet(gh_template,fls[1],path='.');

Now repeat the last command several times.
It takes me 4 times to get to the following error:

generating new GatingSet from the gating template...
copying transformation from gh_template...
Creating flowSet...

start to free GatingSet...

*** caught segfault ***
address 0x2500000024, cause 'memory not mapped'

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection:

Issue confirmed by gfinak

unarchive API

Hi, Mike
Could you look into changing the unarchive API to allow extraction of gating sets into arbitrary directories, rather than into the tmp directory. I think it week be generally very useful.

Thanks.
Greg

format axis number for plotGate

Originally we equally divided the transformed scale into 20 intervals ( axis "at" ) and convert them back to raw scales ( axis "label" ) using inverse transformation functions.
It makes the plot axis awkward due to the random raw scale labels we calculated from 20 logicle/log values.

Now we determine raw scale label first, using sequence of 10^(2:n), and convert them to transformed values. Also labels are formatted to exponential expressions, which is more consistent with flowJo plots. See example plot.

The side effect is we save the time for inverse transformation calculation simply because we don't need it any more.

implicit redudant FCS file parsing

During the flowWorkspace parsing,there currently two steps involves FCS IO:

1.construct ncdfFlowSet by "read.ncdfFlowSet", which parses each FCS file by "read.FCS" and construct:
A. flowSet (in-memory data structure)
B. cdf file (since after compenstation and transformation, data needs to be written back to cdf anyway, so at this stage we keep cdf empty in order to reduce the cost of disk IO)

2.before the actual gating (compenstation and transformation), "read.FCS" gets called explicitly one more time,in order to fetch the events data.

So both steps are doing the entire FCS parsing through "read.FCS",yet the first one only needs meta data ,second only data matrix.

I did try to write data to cdf at first step, and loading data from cdf instead of FCS file at second step. Even though it removes the overhead of parsing FSC header, the extra writing cost offset the gains according to the testing on 6 samples from Stanford.

So I may try to hack into read.FCS,to only do the header parsing at 1st step,and only matrix data reading at 2nd step. Hopefully that will give us some extra gains on speed.

memory efficiency of getData method

Right now it remains as the original getData method written for R parser,which subsets the raw data and returns a flowSet. and it could be memory expensive to getData from upstream gates like singlet,lymph that have lots of events.

We may want to overload the original method to return a ncdfFlowSet instead.

1d densityplot support for plotGate

After the 1-D gate display issue for rare populations was fixed in flowViz::densityplot, it will be helpful to add this option to plotGate methods:
The new argument would be "type=c("xyplot","densityplot")", which allows user to optionally visualize gates in 1-d density curve.

change the population indices to increase space efficiency

Right now the indices are stored as global bit vector for the purpose of fast indexing the raw data. It gives good performance of logical operations like unions and intersections compared to the integer indices,yet costing more space. For instance, the FCS file of 250k events is about 15M.
170 gates will end up having indices of N*170/10^6/8=5.2MB. which is 1/3 of raw data size.

We have following solutions based on the discussions so far:
1.compression of bit vector( run-length encoding is one of compressing techniques we can use)

2.using global integer indices, if most of gates are rare populations (like cytokine gates),it could be more efficient even integer vector is 32 times as big as bit vector

3.using local indices instead of global one(suggested by Raphael ages ago), it can be either bit or integer vector.

1st and 3rd can ease the space problem for sure,just at cost of some extra computation time (either sorting or converting local indices to global).

A modified version of 2nd could be promising too: basically we could mix two types of indices in one gating tree, each gate can store indices as either bit or integer vector, based on which ever is smaller.

transformation parsing issue

C parser failed to find correct calibration tables from NHLBI workspace due to the multiple sets of cal tables existing in the xml.

R parser seems to do the job. I need to trouble shoot and fix it.

potential risk of segfault due to the empty external pointer in GatingSet object

when splitting data into subjects and running flowAP on each subject in parallel by snow, how ever the gating set objects collected from each computing node has Nil external pointer. Apparently it is because R doesn't take care of cloning the c data structure.

I can by pass this issue by using one Gating set and splitting the jobs at each gate level, but it is still a serious issue to have an empty external pointer in GatingSet R object, which easily crashes R.
Because the external pointer is not a simple NULL pointer when passed from R to C, so there is no way to do the validity check on C side.
Not sure how to do the validity check in R either,since the simple comparison : ptr=="<pointer: (nil)>" doesn't work.

export gating strategy from flowWorkspace

Other than saving the entire analysis (gatingSet) mentioned in Issue #11, we would also like to provide the API to export just the gating strategy(or gating template) from sample-specific gatingHiearchy. Gating-ML would be one good format to store such gating strategy.

Just browsed the Gating-ML specifications. my impression is:

  1. It is a flat-structured schema that defines the individual component:gates,trans,comp
  2. also gate component has the optional parent_id attribute,which refers to another gate,thus the gating strategy can be represented by this parent/children relationships.
    3.Analysis-oriented hierarchical structure is not part of the specification

Secondly,
windows flowJo workspace (v7.6.4) already uses gating-ML to define nodes like gates,trans,comps . but mac version (9.4.2) doesn't yet.

Thirdly,
flowUtils package is documented as the tool to parse individual gating-ML components into R (flowCore objects.) Since there is no gating template class defined in flowCore, there is currently no functionality to reconstruct the hierarchical gating strategy.
It is worth to note that flowUtils does have un-documented and un-exposed(yet quite sophisticated) routines to convert workflow to flowJo workspace (which presumably complies with gating-ML). I will have to test it to see how it works before we want to leverage it.

get Children and parent method

there is some issue with converting between VertexID and node index,which causes the incorrect results returned particularly for node 6 "True T-Regs". Needs to be debugged and fixed.

subsetting of gating set

in method "ncFlowSet" of signature(x="GatingSet"), it did not subset on ncdfFlowSet/flowSet based on the "[" operation on the GatingSet because it was originally designed (deprecated getNcdf function) for extracting the back-end cdf data repository.

Now since we expect it to return the actual subsetted data, we need to subset it and return a subview of data instead of entire dataset according to the gatingset (G[1:n] e.g.) .

Hopefully the change won't beak any other packages.

Error subsetting gating sets

An arbitrary example:

If G is a gating set of size 20 and I call:

G[1:10]
associate the ncdfFlowSet to tree structure...

without assigning the result anywhere..
then

G[1:20]
Error: Subset out of bounds

The subset operation seems to modify the existing gating set somewhere... and later subset operations on the original gating set are broken..

extend the current Viz API:plotGate

plotGate method currently have signature(x="GatingHierarchy",y="numeric") or signature(x="GatingHierarchy",y="character") to plot 1 gate vs 1 FCS.

We could add signature(x="GatingHierarchy") without population index (y), or allow y as a vector to plot all/some sequential gates in one pages per FCS.

Also we could add signature(x="GatingSet",y="numeric") to this method,which does the lattice plot for 1 gate vs n FCSs,
and signature(x="GatingSet") for n gates vs n FCSs

performance testing

1.evaluate the speed without "read.ncdfFlow" part
2.test on bigger dataset (ITN data)

parsing gating template without the actual FCS presence

Parsing 1 samples
Parsing sampleID 74014
Removing 1 samples from the analysis since we can't find their FCS files.

GT
A GatingSet with 0 samples
1 . Error in show(object[[i]]) :
error in evaluating the argument 'object' in selecting a method for function 'show': Error in x@set[[i]] : subscript out of bounds

subset by filename should use $FIL keyword

Subsetting samples by filename in parseWorkspace() should use the $FIL keyword rather than the sample name tag in the xml file to get the filename. Currently subsetting by filename is broken for any workspace where the sample names don't use the FCS file name (i.e. the ITN data).

Need API to combine gating sets

We need an api to combine gating sets that share the same set of markers, i.e. like the data we read in from HVTN studies.

This would help solve the problem that normalization only works within a gating set. Combining gating sets would let us normalize more samples to a single reference.

Additionally, I guess it could be useful to move samples around from one gating set to another.. i.e. if I have a reference sample, perhaps I'd like to add it to an empty gating set or to a gating set from a different xml file.

Either of these would be workable solutions.. I guess one would be better in a low memory environment than the other.

Keywords metadata not present in the gatingSet created by C++ parser

It seems that the keywords are not extracted from the FCS file and not stored in the metadata of the GatingHierarchy when creating a gating set using the c++ parser.

getKeywords(gs[[1]]) should return the keywords for the FCS file.
keyword(gs,"keyword") should return the value of "keyword" for all elements in the gating set.

The current R api uses the graph package. I guess this would have to be changed for the C++ parser. We'll need it anyway for the Labkey integration.

test C parser on different type of workspaces before merging to BioC

It is probably necessary to test the C parser on all the workspaces we have to make sure the comp,trans,gates and stats are correctly parsed.

It's only been tested on 3 workspaces from Stanford,Blomberg and ITN, far from sufficient in terms of robustness test.

(Maybe @gfinak would have some of these data since R version has been well tested with quite a lot of use cases, I believe)

space issue caused by R environment

when I save a GatingSet (with just 2 samples) R object to rda/rds file,it takes 50M,which is even bigger than the raw data itself. then I tried to save only the data environment that contains ncdfFlowSet,it still takes same amount of space even though ncdfFlowSet itself just uses 3.7k.

I've ls the data environment, and nothing there but the ncdfFlowSet, which is really puzzling me.

I am not sure how to resolve this issue because we have to use environment variable to store one copy of ncdfFlowSet in each sample ((like a multiple pointers refer to the same raw data).

more efficiently archive a subsetted GatingSet

Currently archive method simply backup the original cdf file and serialize entire c data structure. It can be inefficient when archive a subsetted GatingSet. For example: say G contains 400 samples, when we run archive(G[1:2],"backup.tar") command, we actually expect to save the data only for first two samples.

Saving a subsetted gatingset more efficiently could be achieved by using the existing clone AP:
G1<-clone(G[1:2])
archive(G1,"backup.tar")

Because clone method does subset both cdf and c structure ,however,it will lose the transformation objects since clone method doesn't copy transformations at the moment.

So eventually we do need to modify the current archive API to subset cdf and c structure before serialize them to disk.

extend gate coordinates of zero value(transformed scale)

flowJo sometime generate the gate coordinates at the boundary,which really meant to include all the events below that boundary. There is not ideal reliable way to tell whether we should extend it during the parsing. One solution is to add an argument "extend_zero" to parseWorkspace so that user gets to control the behaviour of such extension.

Installation does not tolerate non-empty R session start-up script Rprofile.site

During installation there is an error:

g++ -m64 -shared -L/usr/local/lib64 -o flowWorkspace.so GatingHierarchy.o GatingSet.o R_GatingHierarchy.o R_GatingSet.o bitOps.o calibrationTable.o flowData.o flowJoWorkspace.o gate.o init.o macFlowJoWorkspace.o ncdfFlow.o nodeProperties.o spline.o transformation.o winFlowJoWorkspace.o workspace.o wsNode.o Welcome at Fri Sep 28 10:58:13 2012 -L/home/ldashevs/programs/R/library/Rcpp/lib -lRcpp -Wl,-rpath,/home/ldashevs/programs/R/library/Rcpp/lib Goodbye at Fri Sep 28 10:58:14 2012 -lxml2 -L/usr/local/lib -lnetcdf -L/usr/lib64 -lboost_serialization -L/home/ldashevs/programs/R/lib -lR
BiocInstaller version 1.5.12, ?biocLite for help
g++: Welcome: No such file or directory
g++: at: No such file or directory
g++: Fri: No such file or directory
g++: Sep: No such file or directory
g++: 28: No such file or directory
g++: 10:58:14: No such file or directory
g++: 2012: No such file or directory
g++: Goodbye: No such file or directory
g++: at: No such file or directory
g++: Fri: No such file or directory
g++: Sep: No such file or directory
g++: 28: No such file or directory
g++: 10:58:14: No such file or directory
g++: 2012: No such file or directory
make: *** [flowWorkspace.so] Error 1

As can be seen above, somehow, the output of the start up script Rprofile.site gets injected into one of the compilation commands' arguments...

issue of cleaning up cdf file in tmp folder

GateSet creates cdf file in /tmp/xxx folder by default , and the subfolder "xxx" is normally deleted automatically when R session is finished. However,if R session is terminated abnormally,like by kill command in linux, somehow the temporary subfolder remains along with those cdf files .

For large data set like >500 samples, cdf file could easily go beyond 7G,this could be potential issue for disk usage.
Like today, when I had several test runs for ITN QA, my disk quota was very quickly reached to the limit thus failed the program.

Not sure how to address this issue yet, since we don't want to delete cdf when one gatingset is removed because there might be several gatingset objects that point to the same cdf.

provide API to save new GatingSet object

Currently most of data structures reside in c++, R object only stores:

  1. the pointer to c++ data structure
    2.data Environment that stores ncdfFlowSet and axis.labels

so the serialization routine needs to be implemented in c++ to save the first part.
since c++ does not natively support serialization for complex objects, we will consider using boost Serialization library to do the deep copying of entire gating set class.

Parsing diva workspaces

I just wanted to open a request for parsing Diva workspaces. We should try to get a workspace to get a sense of how much work will be required.

parse correct sampleNames

Due to the issue,I plan to add an option to flowWorkspace so that user can choose to parse correct FCS filenames from either keyword "$FIL" or "name" attribute from SampleNode.

the garbage collection issue on the external pointer

problem description:
The memory for each gating hierarchy is allocated from heap by "new" within c++.
Now say G is a gatingSet object in R, it only contains the pointer to the equivalent gatingset from c++.
when gh<-G[[x]] gets called, gating hierarchy "x" is exposed as one external pointer in R object gh.
the problem is: when gh is removed in R, the pointer is also out of scope thus gets garbage collected by R,regardless of the fact that the gating hierarchy gh points to should lives in the life time of gatingset G.(unless we will allow the operation of deleting gh from G explicitly by the user in future).

there are three different ways to address this issue:

  1. Within c++, let gating set store map <string,gatingHierarchy> instead of map <string,gatingHierarchy *> to avoid allocating memory by "new", then try to expose them as external pointers to R to see if they can survive from R garbage collections

2.let G store the pointers to the gating hierarchies,thus these pointers are not out of scope as long as G exists

3.let gh object store the sampleName and a copy of the pointer to gating set instead of gating hierarchy itself,thus hiding the pointers from R, which could be safer than 2, yet brings a little overhead from indexing Gating set by sampleName each time for G[[x]] operation.

I will try 1 first and then go for 3.

@gfinak @raphg

booleanGate support to "add" method

add "complementFilter","unionFilter","intersectFilter" support to "add" method, converting them to "BooleanGate" class in R before adding them to C data structure.

Provide mapping of resolved workspace sample id -> file path to parseWorkspace()

I'd like to provide a mapping to parseWorkspace() of workspace samples to file paths that have been previously resolved by the user. Currently parseWorkspace() will search the file system starting at the directory named by the path argument for FCS files. However, the importer may have already resolved the files in a previous step.

remove number index from the getNode result

getNode method right now appends an integer index to the pop name in order to make the pop name unique within the entire gating tree. It would be idea to do it only when the pop name has duplication.Like the pop names of nonDebris ,lymph don't' have to be changed since they are already unique.

clone gatingset

Clone method for gatingSet currently calls the boost serialization/de-serialization routines in order to have a full copy of c++ data structure.

Originall disk was used as cache for serialization/de-serialization, later changed to std::stringstream as memory cache,which proves to be magnitudes faster.

However, It turned out that stringstream doesn't work well with bigger dataset (245 samples ),which failed the at the memcpy stage. When I switched back to disk cache, the issue was gone instantly.

non-cdf version of c++ parser

I am thinking about implementing flowSet version of c++ parser. There might be two reasons that users may potentially prefer to this:
1.avoid the hassle of installing ncdf+hdf library
2.care more about the speed than memory (if they've got the large-memory computer )

and .Call is "pass by reference" , right? so the overhead of passing flow data from R to C shouldn't be an issue.

@gfinak
@raphg

Bug in gating / transformation of HVTN data

Serious bug in flowworkspace gating of HVTN data

For workspace 080 batch 0939.xml, it appears that the transformation of the gates multiplies the coordinates by 64, when it should not. The data are in the lower left of the plot, while the gate is scaled up. Consequently downstream gates are empty. There are multiple workspaces with this issue. The one quoted here should reproduce the problem.

This is a high priority issue.

Data can be found in /shared/silo_researcher/Gottardo_R/gfinak_working/NormalizationData

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.