rglab / flowworkspace Goto Github PK

View Code? Open in Web Editor NEW

44.0 15.0 21.0 27.26 MB

flowWorkspace

License: GNU Affero General Public License v3.0

R 79.64% C++ 19.94% JavaScript 0.23% C 0.19%

flowworkspace's Introduction

flowWorkspace: An infrastructure tool for the hierarchical gated flow cytometry data.

This package is designed to store, query and visualize the hierarchical gated flow data.

It also facilitates the comparison of automated gating methods against manual gating by

importing basic flowJo workspaces into R and replicate the gating from flowJo using the flowCore functionality. Gating hierarchies,

groups of samples, compensation, and transformation are performed so that the output matches the flowJo analysis.

Reporting Bugs or Issues

Use the issue template in github when creating a new issue.
Follow the instructions in the template (do your background reading).
Search and verify that the issue hasn't already been addressed.
Check the Bioconductor support site.
Make sure your flow packages are up to date.
THEN if your issue persists, file a bug report.

Otherwise, we may close your issue without responding.

INSTALLATION

# First, install it from bionconductor so that it will pull all the dependent packages automatically
biocManager::install("flowWorkspace") # may be older
# Then, install the latest version from github using devtools package 
install.packages("devtools") 
library(devtools) #load it
install_github("RGLab/flowWorkspace")

Import flowJo workspace

library(flowWorkspace)
dataDir <- system.file("extdata", package="flowWorkspaceData")
wsfile <- list.files(dataDir, pattern="manual.xml",full=TRUE)
ws <- openWorkspace(wsfile);
gs <- parseWorkspace(ws, path = dataDir, name = 4, subset = "CytoTrol_CytoTrol_1.fcs")
gs

#get the first sample
gh <- gs[[1]]
#plot the hierarchy tree
plot(gh)
#show all the cell populations(/nodes)
getNodes(gh)
#show the population statistics
getPopStats(gh)
#plot the gates
plotGate(gh)

More examples:

flowworkspace's People

Contributors

Stargazers

Watchers

flowworkspace's Issues

add rbind method to ncdfFlow and flowWorkspace

It will be handy to have combind method to take a list of ncdfFlowSets/gatingSets instead of doing 2 at a time by rbind2. Meanwhile it is going to resolve the disk space issue caused by repetitive copying cdf files during multiple rbind2 call.

plotting and naming of plotted gates

Write a new plotting routine for LabKey that:

For plots exported to LabKey

Plot multiple gates using the same dimensions and the same parent population on the same graph
Name the plot according to the parent population not the gated populations
Use the order of dimensions defined in the gate not as they appear in the flowFrame.

moving the inverse splinefun on range to c++ and expose it to R

add BooleanGate logic to C parser

Error while creating a GatingSet

Using
R: 2012-03-19 r58787

flowWorkspace:
branch: cpp_devel (70)

On Blackrhino navigate to the folder: "/loc/no-backup/mike/ITN029ST"

Fire up an R session and input the following commands:

suppressMessages( library(QUALIFIER) );
ws<-openWorkspace("QA_MFI_RBC_bounary_eventsV3.xml");
GT<-parseWorkspace(ws,execute=FALSE,useInternal=TRUE,subset=c(1));
2 ( <- choice in the interactive prompt)
gh_template<-GT[[1]];
fls <- list.files(pattern='*.fcs');
G<-GatingSet(gh_template,fls[1],path='.');

Now repeat the last command several times.
It takes me 4 times to get to the following error:

generating new GatingSet from the gating template...
copying transformation from gh_template...
Creating flowSet...

start to free GatingSet...

*** caught segfault ***
address 0x2500000024, cause 'memory not mapped'

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection:

Issue confirmed by gfinak

check if there is potential memory leaking problem

It maybe helpful to use Valgrind to detect any potential memory leaking problem in c++ code.

unarchive API

Hi, Mike
Could you look into changing the unarchive API to allow extraction of gating sets into arbitrary directories, rather than into the tmp directory. I think it week be generally very useful.

Thanks.
Greg

format axis number for plotGate

Originally we equally divided the transformed scale into 20 intervals ( axis "at" ) and convert them back to raw scales ( axis "label" ) using inverse transformation functions.
It makes the plot axis awkward due to the random raw scale labels we calculated from 20 logicle/log values.

Now we determine raw scale label first, using sequence of 10^(2:n), and convert them to transformed values. Also labels are formatted to exponential expressions, which is more consistent with flowJo plots. See example plot.

The side effect is we save the time for inverse transformation calculation simply because we don't need it any more.

implicit redudant FCS file parsing

During the flowWorkspace parsing,there currently two steps involves FCS IO:

1.construct ncdfFlowSet by "read.ncdfFlowSet", which parses each FCS file by "read.FCS" and construct:
A. flowSet (in-memory data structure)
B. cdf file (since after compenstation and transformation, data needs to be written back to cdf anyway, so at this stage we keep cdf empty in order to reduce the cost of disk IO)

2.before the actual gating (compenstation and transformation), "read.FCS" gets called explicitly one more time,in order to fetch the events data.

So both steps are doing the entire FCS parsing through "read.FCS",yet the first one only needs meta data ,second only data matrix.

I did try to write data to cdf at first step, and loading data from cdf instead of FCS file at second step. Even though it removes the overhead of parsing FSC header, the extra writing cost offset the gains according to the testing on 6 samples from Stanford.

So I may try to hack into read.FCS,to only do the header parsing at 1st step,and only matrix data reading at 2nd step. Hopefully that will give us some extra gains on speed.

extract compensation and transformations from pc flowJo

memory efficiency of getData method

Right now it remains as the original getData method written for R parser,which subsets the raw data and returns a flowSet. and it could be memory expensive to getData from upstream gates like singlet,lymph that have lots of events.

We may want to overload the original method to return a ncdfFlowSet instead.

1d densityplot support for plotGate

After the 1-D gate display issue for rare populations was fixed in flowViz::densityplot, it will be helpful to add this option to plotGate methods:
The new argument would be "type=c("xyplot","densityplot")", which allows user to optionally visualize gates in 1-d density curve.

change the population indices to increase space efficiency

Right now the indices are stored as global bit vector for the purpose of fast indexing the raw data. It gives good performance of logical operations like unions and intersections compared to the integer indices,yet costing more space. For instance, the FCS file of 250k events is about 15M.
170 gates will end up having indices of N*170/10^6/8=5.2MB. which is 1/3 of raw data size.

We have following solutions based on the discussions so far:
1.compression of bit vector( run-length encoding is one of compressing techniques we can use)

2.using global integer indices, if most of gates are rare populations (like cytokine gates),it could be more efficient even integer vector is 32 times as big as bit vector

3.using local indices instead of global one(suggested by Raphael ages ago), it can be either bit or integer vector.

1st and 3rd can ease the space problem for sure,just at cost of some extra computation time (either sorting or converting local indices to global).

A modified version of 2nd could be promising too: basically we could mix two types of indices in one gating tree, each gate can store indices as either bit or integer vector, based on which ever is smaller.

transformation parsing issue

C parser failed to find correct calibration tables from NHLBI workspace due to the multiple sets of cal tables existing in the xml.

R parser seems to do the job. I need to trouble shoot and fix it.

potential risk of segfault due to the empty external pointer in GatingSet object

when splitting data into subjects and running flowAP on each subject in parallel by snow, how ever the gating set objects collected from each computing node has Nil external pointer. Apparently it is because R doesn't take care of cloning the c data structure.

I can by pass this issue by using one Gating set and splitting the jobs at each gate level, but it is still a serious issue to have an empty external pointer in GatingSet R object, which easily crashes R.
Because the external pointer is not a simple NULL pointer when passed from R to C, so there is no way to do the validity check on C side.
Not sure how to do the validity check in R either,since the simple comparison : ptr=="<pointer: (nil)>" doesn't work.

export gating strategy from flowWorkspace

Other than saving the entire analysis (gatingSet) mentioned in Issue #11, we would also like to provide the API to export just the gating strategy(or gating template) from sample-specific gatingHiearchy. Gating-ML would be one good format to store such gating strategy.

Just browsed the Gating-ML specifications. my impression is:

It is a flat-structured schema that defines the individual component:gates,trans,comp
also gate component has the optional parent_id attribute,which refers to another gate,thus the gating strategy can be represented by this parent/children relationships.
3.Analysis-oriented hierarchical structure is not part of the specification

Secondly,
windows flowJo workspace (v7.6.4) already uses gating-ML to define nodes like gates,trans,comps . but mac version (9.4.2) doesn't yet.

Thirdly,
flowUtils package is documented as the tool to parse individual gating-ML components into R (flowCore objects.) Since there is no gating template class defined in flowCore, there is currently no functionality to reconstruct the hierarchical gating strategy.
It is worth to note that flowUtils does have un-documented and un-exposed(yet quite sophisticated) routines to convert workflow to flowJo workspace (which presumably complies with gating-ML). I will have to test it to see how it works before we want to leverage it.

count discrepancy between flowJo and flowCore for HVTN 080

need to be investigated and fixed

port flowWorkspace2flowCore to c++ parser

get Children and parent method

there is some issue with converting between VertexID and node index,which causes the incorrect results returned particularly for node 6 "True T-Regs". Needs to be debugged and fixed.

getTransformation in c++ parser crashes R

subsetting of gating set

in method "ncFlowSet" of signature(x="GatingSet"), it did not subset on ncdfFlowSet/flowSet based on the "[" operation on the GatingSet because it was originally designed (deprecated getNcdf function) for extracting the back-end cdf data repository.

Now since we expect it to return the actual subsetted data, we need to subset it and return a subview of data instead of entire dataset according to the gatingset (G[1:n] e.g.) .

Hopefully the change won't beak any other packages.

Error subsetting gating sets

An arbitrary example:

If G is a gating set of size 20 and I call:

G[1:10]
associate the ncdfFlowSet to tree structure...

without assigning the result anywhere..
then

G[1:20]
Error: Subset out of bounds

The subset operation seems to modify the existing gating set somewhere... and later subset operations on the original gating set are broken..

extend the current Viz API:plotGate

plotGate method currently have signature(x="GatingHierarchy",y="numeric") or signature(x="GatingHierarchy",y="character") to plot 1 gate vs 1 FCS.

We could add signature(x="GatingHierarchy") without population index (y), or allow y as a vector to plot all/some sequential gates in one pages per FCS.

Also we could add signature(x="GatingSet",y="numeric") to this method,which does the lattice plot for 1 gate vs n FCSs,
and signature(x="GatingSet") for n gates vs n FCSs

document APIs for C parser

performance testing

1.evaluate the speed without "read.ncdfFlow" part
2.test on bigger dataset (ITN data)

parsing gating template without the actual FCS presence

Parsing 1 samples
Parsing sampleID 74014
Removing 1 samples from the analysis since we can't find their FCS files.

GT
A GatingSet with 0 samples
1 . Error in show(object[[i]]) :
error in evaluating the argument 'object' in selecting a method for function 'show': Error in x@set[[i]] : subscript out of bounds

add R API for internal structure of gating hierarchy

1.change the configure file to make the package compile with the new added cpp files
2.write example R code to interact with c structure

test and make c++ parser compatible with normalization in flowStats

Once this is done, we can roll out the new flowWorkspace into BioC devel

subset by filename should use $FIL keyword

Subsetting samples by filename in parseWorkspace() should use the $FIL keyword rather than the sample name tag in the xml file to get the filename. Currently subsetting by filename is broken for any workspace where the sample names don't use the FCS file name (i.e. the ITN data).

flowWorkspace is not correctly detecting the workspace version for one of the HVTN's data files

This results in the gate coordinates not being divided by 64, as they should be for older workspaces. Need to fix this and find a robust way to detect the flowJo version that generated the workspace.

Apply the QA gating template provided by the ITN

The gating template provides the coordinates for the gates so the flowWorkspace reads them and applies on the other fcs files.

Need API to combine gating sets

We need an api to combine gating sets that share the same set of markers, i.e. like the data we read in from HVTN studies.

This would help solve the problem that normalization only works within a gating set. Combining gating sets would let us normalize more samples to a single reference.

Additionally, I guess it could be useful to move samples around from one gating set to another.. i.e. if I have a reference sample, perhaps I'd like to add it to an empty gating set or to a gating set from a different xml file.

Either of these would be workable solutions.. I guess one would be better in a low memory environment than the other.

Keywords metadata not present in the gatingSet created by C++ parser

It seems that the keywords are not extracted from the FCS file and not stored in the metadata of the GatingHierarchy when creating a gating set using the c++ parser.

getKeywords(gs[[1]]) should return the keywords for the FCS file.
keyword(gs,"keyword") should return the value of "keyword" for all elements in the gating set.

The current R api uses the graph package. I guess this would have to be changed for the C++ parser. We'll need it anyway for the Labkey integration.

test C parser on different type of workspaces before merging to BioC

It is probably necessary to test the C parser on all the workspaces we have to make sure the comp,trans,gates and stats are correctly parsed.

It's only been tested on 3 workspaces from Stanford,Blomberg and ITN, far from sufficient in terms of robustness test.

(Maybe @gfinak would have some of these data since R version has been well tested with quite a lot of use cases, I believe)

space issue caused by R environment

when I save a GatingSet (with just 2 samples) R object to rda/rds file,it takes 50M,which is even bigger than the raw data itself. then I tried to save only the data environment that contains ncdfFlowSet,it still takes same amount of space even though ncdfFlowSet itself just uses 3.7k.

I've ls the data environment, and nothing there but the ncdfFlowSet, which is really puzzling me.

I am not sure how to resolve this issue because we have to use environment variable to store one copy of ncdfFlowSet in each sample ((like a multiple pointers refer to the same raw data).

more efficiently archive a subsetted GatingSet

Currently archive method simply backup the original cdf file and serialize entire c data structure. It can be inefficient when archive a subsetted GatingSet. For example: say G contains 400 samples, when we run archive(G[1:2],"backup.tar") command, we actually expect to save the data only for first two samples.

Saving a subsetted gatingset more efficiently could be achieved by using the existing clone AP:
G1<-clone(G[1:2])
archive(G1,"backup.tar")

Because clone method does subset both cdf and c structure ,however,it will lose the transformation objects since clone method doesn't copy transformations at the moment.

So eventually we do need to modify the current archive API to subset cdf and c structure before serialize them to disk.

extend gate coordinates of zero value(transformed scale)

flowJo sometime generate the gate coordinates at the boundary,which really meant to include all the events below that boundary. There is not ideal reliable way to tell whether we should extend it during the parsing. One solution is to add an argument "extend_zero" to parseWorkspace so that user gets to control the behaviour of such extension.

Installation does not tolerate non-empty R session start-up script Rprofile.site

During installation there is an error:

g++ -m64 -shared -L/usr/local/lib64 -o flowWorkspace.so GatingHierarchy.o GatingSet.o R_GatingHierarchy.o R_GatingSet.o bitOps.o calibrationTable.o flowData.o flowJoWorkspace.o gate.o init.o macFlowJoWorkspace.o ncdfFlow.o nodeProperties.o spline.o transformation.o winFlowJoWorkspace.o workspace.o wsNode.o Welcome at Fri Sep 28 10:58:13 2012 -L/home/ldashevs/programs/R/library/Rcpp/lib -lRcpp -Wl,-rpath,/home/ldashevs/programs/R/library/Rcpp/lib Goodbye at Fri Sep 28 10:58:14 2012 -lxml2 -L/usr/local/lib -lnetcdf -L/usr/lib64 -lboost_serialization -L/home/ldashevs/programs/R/lib -lR
BiocInstaller version 1.5.12, ?biocLite for help
g++: Welcome: No such file or directory
g++: at: No such file or directory
g++: Fri: No such file or directory
g++: Sep: No such file or directory
g++: 28: No such file or directory
g++: 10:58:14: No such file or directory
g++: 2012: No such file or directory
g++: Goodbye: No such file or directory
g++: at: No such file or directory
g++: Fri: No such file or directory
g++: Sep: No such file or directory
g++: 28: No such file or directory
g++: 10:58:14: No such file or directory
g++: 2012: No such file or directory
make: *** [flowWorkspace.so] Error 1

As can be seen above, somehow, the output of the start up script Rprofile.site gets injected into one of the compilation commands' arguments...

issue of cleaning up cdf file in tmp folder

GateSet creates cdf file in /tmp/xxx folder by default , and the subfolder "xxx" is normally deleted automatically when R session is finished. However,if R session is terminated abnormally,like by kill command in linux, somehow the temporary subfolder remains along with those cdf files .

For large data set like >500 samples, cdf file could easily go beyond 7G,this could be potential issue for disk usage.
Like today, when I had several test runs for ITN QA, my disk quota was very quickly reached to the limit thus failed the program.

Not sure how to address this issue yet, since we don't want to delete cdf when one gatingset is removed because there might be several gatingset objects that point to the same cdf.

provide API to save new GatingSet object

Currently most of data structures reside in c++, R object only stores:

the pointer to c++ data structure
2.data Environment that stores ncdfFlowSet and axis.labels

so the serialization routine needs to be implemented in c++ to save the first part.
since c++ does not natively support serialization for complex objects, we will consider using boost Serialization library to do the deep copying of entire gating set class.

Parsing diva workspaces

I just wanted to open a request for parsing Diva workspaces. We should try to get a workspace to get a sense of how much work will be required.

move compensation part from R to c++ to further speed up

parse correct sampleNames

Due to the issue,I plan to add an option to flowWorkspace so that user can choose to parse correct FCS filenames from either keyword "$FIL" or "name" attribute from SampleNode.

the garbage collection issue on the external pointer

problem description:
The memory for each gating hierarchy is allocated from heap by "new" within c++.
Now say G is a gatingSet object in R, it only contains the pointer to the equivalent gatingset from c++.
when gh<-G[[x]] gets called, gating hierarchy "x" is exposed as one external pointer in R object gh.
the problem is: when gh is removed in R, the pointer is also out of scope thus gets garbage collected by R,regardless of the fact that the gating hierarchy gh points to should lives in the life time of gatingset G.(unless we will allow the operation of deleting gh from G explicitly by the user in future).

there are three different ways to address this issue:

Within c++, let gating set store map <string,gatingHierarchy> instead of map <string,gatingHierarchy *> to avoid allocating memory by "new", then try to expose them as external pointers to R to see if they can survive from R garbage collections

2.let G store the pointers to the gating hierarchies,thus these pointers are not out of scope as long as G exists

3.let gh object store the sampleName and a copy of the pointer to gating set instead of gating hierarchy itself,thus hiding the pointers from R, which could be safer than 2, yet brings a little overhead from indexing Gating set by sampleName each time for G[[x]] operation.

I will try 1 first and then go for 3.

@gfinak @raphg

booleanGate support to "add" method

add "complementFilter","unionFilter","intersectFilter" support to "add" method, converting them to "BooleanGate" class in R before adding them to C data structure.

Provide mapping of resolved workspace sample id -> file path to parseWorkspace()

I'd like to provide a mapping to parseWorkspace() of workspace samples to file paths that have been previously resolved by the user. Currently parseWorkspace() will search the file system starting at the directory named by the path argument for FCS files. However, the importer may have already resolved the files in a previous step.

remove number index from the getNode result

getNode method right now appends an integer index to the pop name in order to make the pop name unique within the entire gating tree. It would be idea to do it only when the pop name has duplication.Like the pop names of nonDebris ,lymph don't' have to be changed since they are already unique.

clone gatingset

Clone method for gatingSet currently calls the boost serialization/de-serialization routines in order to have a full copy of c++ data structure.

Originall disk was used as cache for serialization/de-serialization, later changed to std::stringstream as memory cache,which proves to be magnitudes faster.

However, It turned out that stringstream doesn't work well with bigger dataset (245 samples ),which failed the at the memcpy stage. When I switched back to disk cache, the issue was gone instantly.

non-cdf version of c++ parser

I am thinking about implementing flowSet version of c++ parser. There might be two reasons that users may potentially prefer to this:
1.avoid the hassle of installing ncdf+hdf library
2.care more about the speed than memory (if they've got the large-memory computer )

and .Call is "pass by reference" , right? so the overhead of passing flow data from R to C shouldn't be an issue.

@gfinak
@raphg

Bug in gating / transformation of HVTN data

Serious bug in flowworkspace gating of HVTN data

For workspace 080 batch 0939.xml, it appears that the transformation of the gates multiplies the coordinates by 64, when it should not. The data are in the lower left of the plot, while the gate is scaled up. Consequently downstream gates are empty. There are multiple workspaces with this issue. The one quoted here should reproduce the problem.

This is a high priority issue.

Data can be found in /shared/silo_researcher/Gottardo_R/gfinak_working/NormalizationData