kr-colab / filet Goto Github PK

View Code? Open in Web Editor NEW

18.0 18.0 5.0 130.88 MB

Software for detecting introgression using supervised machine learning

License: GNU General Public License v3.0

Makefile 0.35% Python 12.08% Shell 3.50% Scheme 9.43% C 74.64%

filet's People

Contributors

Stargazers

Watchers

Forkers

flag0010 dschride svitlanalukicheva wangjie07070910

filet's Issues

timeout or memory leak warning

Hi @andrewkern @dschride ,
I have been running into a Warning when running FILET training. Anything to be worried about?
thanks,
@stsmall
Warning::
training set size after balancing: 270003
Checking accuracy when distinguishing among all 3 classes
Using extraTreesClassifier
/anaconda3/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py:691: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
"timeout or by a memory leak.", UserWarning
GridSearchCV took 10745.41 seconds for 24 candidate parameter settings.
Results for extraTreesClassifier

Index Error on Classify with different probThreshold

Hi @andrewkern, @dschride,
I am getting an odd and intermittent error on Classify.

Traceback (most recent call last):
File "classifyChromosome.py", line 64, in
writePreds(predictions, probs, coords, outFileName)
File "classifyChromosome.py", line 31, in writePreds
outLine = coords[i] + [predictions[i]] + list(probs[i])
IndexError: index 4981 is out of bounds for axis 0 with size 4981

Previously with this error I have rebuilt the TrainingSet and retrained and it completes without error. Recently I rebuilt the classifier 3 times and cant seem to classify without errors. Oddly when I change the probThreshold to a lower value, e.g., 0.05 as previously I was using 0.10, it finishes without error.
Any suggestions?
thanks,
@stsmall

Error on training with example dataset

Hi @andrewkern, @dschride
I followed the example downloaded with FILET, but seem to be running into an error during the training step, specifically that trainFiletClassifier.py stops with an error.

any help or suggestion are greatly appreciated!
thanks,
@stsmall

python 2.7 (anaconda version)
scipy v1.0.1
numpy v1.13.3
sklearn v0.19.2

anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
anaconda2/lib/python2.7/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
DeprecationWarning)
training set size after balancing: 29940
Checking accuracy when distinguishing among all 3 classes
Using extraTreesClassifier
Traceback (most recent call last):
File "trainFiletClassifier.py", line 81, in
grid_search.fit(X, y)
File "anaconda2/lib/python2.7/site-packages/sklearn/grid_search.py", line 838, in fit
return self._fit(X, y, ParameterGrid(self.param_grid))
File "anaconda2/lib/python2.7/site-packages/sklearn/grid_search.py", line 574, in _fit
for parameters in parameter_iterable
File "anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 789, in call
self.retrieve()
File "anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 740, in retrieve
raise exception
sklearn.externals.joblib.my_exceptions.JoblibValueError: JoblibValueError

Multiprocessing exception:
...........................................................................
FILET/trainFiletClassifier.py in ()
76 clf, mlType, paramGrid = ExtraTreesClassifier(n_estimators=100, random_state=0), "extraTreesClassifier", param_grid_forest
77
78 sys.stderr.write("Using %s\n" %(mlType))
79 grid_search = GridSearchCV(clf,param_grid=param_grid_forest,cv=10,n_jobs=10)
80 start = time()
---> 81 grid_search.fit(X, y)
82 sys.stderr.write("GridSearchCV took %.2f seconds for %d candidate parameter settings.\n"
83 % (time() - start, len(grid_search.grid_scores_)))
84 print "Results for %s" %(mlType)
85 report(grid_search.grid_scores_)

...........................................................................
anaconda2/lib/python2.7/site-packages/sklearn/grid_search.py in fit(self=GridSearchCV(cv=10, error_score='raise',
...='2n_jobs', refit=True, scoring=None, verbose=0), X=array([[ 6.53900000e-03, 1.00000000e-06, 3....543860e+02, 1.00000000e+00, 1.00000000e+00]]), y=['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', ...])
833 y : array-like, shape = [n_samples] or [n_samples, n_output], optional
834 Target relative to X for classification or regression;
835 None for unsupervised learning.
836
837 """
--> 838 return self._fit(X, y, ParameterGrid(self.param_grid))
self._fit = <bound method GridSearchCV._fit of GridSearchCV(...'2n_jobs', refit=True, scoring=None, verbose=0)>
X = array([[ 6.53900000e-03, 1.00000000e-06, 3....543860e+02, 1.00000000e+00, 1.00000000e+00]])
y = ['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', ...]
self.param_grid = {'bootstrap': [True, False], 'criterion': ['gini', 'entropy'], 'max_depth': [3, 10, None], 'max_features': [1, 3, 4, 22], 'min_samples_leaf': [1, 3, 10], 'min_samples_split': [1, 3, 10]}
839
840
841 class RandomizedSearchCV(BaseSearchCV):
842 """Randomized search on hyper parameters.

...........................................................................
anaconda2/lib/python2.7/site-packages/sklearn/grid_search.py in _fit(self=GridSearchCV(cv=10, error_score='raise',
...='2*n_jobs', refit=True, scoring=None, verbose=0), X=array([[ 6.53900000e-03, 1.00000000e-06, 3....543860e+02, 1.00000000e+00, 1.00000000e+00]]), y=['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', ...], parameter_iterable=<sklearn.grid_search.ParameterGrid object>)
569 )(
570 delayed(fit_and_score)(clone(base_estimator), X, y, self.scorer,
571 train, test, self.verbose, parameters,
572 self.fit_params, return_parameters=True,
573 error_score=self.error_score)
--> 574 for parameters in parameter_iterable
parameters = undefined
parameter_iterable = <sklearn.grid_search.ParameterGrid object>
575 for train, test in cv)
576
577 # Out is a list of triplet: score, estimator, n_test_samples
578 n_fits = len(out)

...........................................................................
anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py in call(self=Parallel(n_jobs=10), iterable=<generator object >)
784 if pre_dispatch == "all" or n_jobs == 1:
785 # The iterable was consumed all at once by the above for loop.
786 # No need to wait for async callbacks to trigger to
787 # consumption.
788 self._iterating = False
--> 789 self.retrieve()
self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=10)>
790 # Make sure that we get a last message telling us we are done
791 elapsed_time = time.time() - self._start_time
792 self._print('Done %3i out of %3i | elapsed: %s finished',
793 (len(self._output), len(self._output),

Sub-process traceback:

ValueError Thu Aug 23 12:04:10 2018
PID: 51133Python 2.7.14: anaconda2/bin/python
...........................................................................
anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py in call(self=<sklearn.externals.joblib.parallel.BatchedCalls object>)
126 def init(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def call(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
func =
args = (ExtraTreesClassifier(bootstrap=True, class_weigh...lse, random_state=0, verbose=0, warm_start=False), memmap([[ 6.53900000e-03, 1.00000000e-06, 3...543860e+02, 1.00000000e+00, 1.00000000e+00]]), ['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', ...], , array([ 998, 999, 1000, ..., 29937, 29938, 29939]), array([ 0, 1, 2, ..., 20955, 20956, 20957]), 0, {'bootstrap': True, 'criterion': 'gini', 'max_depth': 3, 'max_features': 1, 'min_samples_leaf': 1, 'min_samples_split': 1}, {})
kwargs = {'error_score': 'raise', 'return_parameters': True}
self.items = [(, (ExtraTreesClassifier(bootstrap=True, class_weigh...lse, random_state=0, verbose=0, warm_start=False), memmap([[ 6.53900000e-03, 1.00000000e-06, 3...543860e+02, 1.00000000e+00, 1.00000000e+00]]), ['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', ...], , array([ 998, 999, 1000, ..., 29937, 29938, 29939]), array([ 0, 1, 2, ..., 20955, 20956, 20957]), 0, {'bootstrap': True, 'criterion': 'gini', 'max_depth': 3, 'max_features': 1, 'min_samples_leaf': 1, 'min_samples_split': 1}, {}), {'error_score': 'raise', 'return_parameters': True})]
132
133 def len(self):
134 return self._size
135

...........................................................................
anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.py in _fit_and_score(estimator=ExtraTreesClassifier(bootstrap=True, class_weigh...lse, random_state=0, verbose=0, warm_start=False), X=memmap([[ 6.53900000e-03, 1.00000000e-06, 3...543860e+02, 1.00000000e+00, 1.00000000e+00]]), y=['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', ...], scorer=, train=array([ 998, 999, 1000, ..., 29937, 29938, 29939]), test=array([ 0, 1, 2, ..., 20955, 20956, 20957]), verbose=0, parameters={'bootstrap': True, 'criterion': 'gini', 'max_depth': 3, 'max_features': 1, 'min_samples_leaf': 1, 'min_samples_split': 1}, fit_params={}, return_train_score=False, return_parameters=True, error_score='raise')
1670
1671 try:
1672 if y_train is None:
1673 estimator.fit(X_train, **fit_params)
1674 else:
-> 1675 estimator.fit(X_train, y_train, **fit_params)
estimator.fit = <bound method ExtraTreesClassifier.fit of ExtraT...se, random_state=0, verbose=0, warm_start=False)>
X_train = memmap([[ 1.05620000e-02, 1.00000000e-06, 5...543860e+02, 1.00000000e+00, 1.00000000e+00]])
y_train = ['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', ...]
fit_params = {}
1676
1677 except Exception as e:
1678 if error_score == 'raise':
1679 raise

...........................................................................
anaconda2/lib/python2.7/site-packages/sklearn/ensemble/forest.py in fit(self=ExtraTreesClassifier(bootstrap=True, class_weigh...lse, random_state=0, verbose=0, warm_start=False), X=array([[ 1.05619999e-02, 9.99999997e-07, 5.....00000000e+00, 1.00000000e+00]], dtype=float32), y=array([[ 0.],
[ 0.],
[ 0.],
...,
[ 2.],
[ 2.],
[ 2.]]), sample_weight=None)
323 trees = Parallel(n_jobs=self.n_jobs, verbose=self.verbose,
324 backend="threading")(
325 delayed(parallel_build_trees)(
326 t, self, X, y, sample_weight, i, len(trees),
327 verbose=self.verbose, class_weight=self.class_weight)
--> 328 for i, t in enumerate(trees))
i = 99
329
330 # Collect newly grown trees
331 self.estimators.extend(trees)
332

...........................................................................
anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py in call(self=Parallel(n_jobs=1), iterable=<generator object >)
774 self.n_completed_tasks = 0
775 try:
776 # Only set self._iterating to True if at least a batch
777 # was dispatched. In particular this covers the edge
778 # case of Parallel used with an exhausted iterator.
--> 779 while self.dispatch_one_batch(iterator):
self.dispatch_one_batch = <bound method Parallel.dispatch_one_batch of Parallel(n_jobs=1)>
iterator = <generator object >
780 self._iterating = True
781 else:
782 self._iterating = False
783

...........................................................................
anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self=Parallel(n_jobs=1), iterator=<generator object >)
620 tasks = BatchedCalls(itertools.islice(iterator, batch_size))
621 if len(tasks) == 0:
622 # No more tasks available in the iterator: tell caller to stop.
623 return False
624 else:
--> 625 self._dispatch(tasks)
self._dispatch = <bound method Parallel._dispatch of Parallel(n_jobs=1)>
tasks = <sklearn.externals.joblib.parallel.BatchedCalls object>
626 return True
627
628 def _print(self, msg, msg_args):
629 """Display the message on stout or stderr depending on verbosity"""

...........................................................................
anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self=Parallel(n_jobs=1), batch=<sklearn.externals.joblib.parallel.BatchedCalls object>)
583 self.n_dispatched_tasks += len(batch)
584 self.n_dispatched_batches += 1
585
586 dispatch_timestamp = time.time()
587 cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)
--> 588 job = self._backend.apply_async(batch, callback=cb)
job = undefined
self._backend.apply_async = <bound method SequentialBackend.apply_async of <...lib._parallel_backends.SequentialBackend object>>
batch = <sklearn.externals.joblib.parallel.BatchedCalls object>
cb = <sklearn.externals.joblib.parallel.BatchCompletionCallBack object>
589 self._jobs.append(job)
590
591 def dispatch_next(self):
592 """Dispatch more data for parallel processing

...........................................................................
anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py in apply_async(self=<sklearn.externals.joblib._parallel_backends.SequentialBackend object>, func=<sklearn.externals.joblib.parallel.BatchedCalls object>, callback=<sklearn.externals.joblib.parallel.BatchCompletionCallBack object>)
106 raise ValueError('n_jobs == 0 in Parallel has no meaning')
107 return 1
108
109 def apply_async(self, func, callback=None):
110 """Schedule a func to be run"""
--> 111 result = ImmediateResult(func)
result = undefined
func = <sklearn.externals.joblib.parallel.BatchedCalls object>
112 if callback:
113 callback(result)
114 return result
115

...........................................................................
anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py in init(self=<sklearn.externals.joblib._parallel_backends.ImmediateResult object>, batch=<sklearn.externals.joblib.parallel.BatchedCalls object>)
327
328 class ImmediateResult(object):
329 def init(self, batch):
330 # Don't delay the application, to avoid keeping the input
331 # arguments in memory
--> 332 self.results = batch()
self.results = undefined
batch = <sklearn.externals.joblib.parallel.BatchedCalls object>
333
334 def get(self):
335 return self.results
336

...........................................................................
anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py in call(self=<sklearn.externals.joblib.parallel.BatchedCalls object>)
126 def init(self, iterator_slice):
127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
130 def call(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
func =
args = (ExtraTreeClassifier(class_weight=None, criterion...dom_state=209652396,
splitter='random'), ExtraTreesClassifier(bootstrap=True, class_weigh...lse, random_state=0, verbose=0, warm_start=False), array([[ 1.05619999e-02, 9.99999997e-07, 5.....00000000e+00, 1.00000000e+00]], dtype=float32), array([[ 0.],
[ 0.],
[ 0.],
...,
[ 2.],
[ 2.],
[ 2.]]), None, 0, 100)
kwargs = {'class_weight': None, 'verbose': 0}
self.items = [(, (ExtraTreeClassifier(class_weight=None, criterion...dom_state=209652396,
splitter='random'), ExtraTreesClassifier(bootstrap=True, class_weigh...lse, random_state=0, verbose=0, warm_start=False), array([[ 1.05619999e-02, 9.99999997e-07, 5.....00000000e+00, 1.00000000e+00]], dtype=float32), array([[ 0.],
[ 0.],
[ 0.],
...,
[ 2.],
[ 2.],
[ 2.]]), None, 0, 100), {'class_weight': None, 'verbose': 0})]
132
133 def len(self):
134 return self._size
135

...........................................................................
anaconda2/lib/python2.7/site-packages/sklearn/ensemble/forest.py in _parallel_build_trees(tree=ExtraTreeClassifier(class_weight=None, criterion...dom_state=209652396,
splitter='random'), forest=ExtraTreesClassifier(bootstrap=True, class_weigh...lse, random_state=0, verbose=0, warm_start=False), X=array([[ 1.05619999e-02, 9.99999997e-07, 5.....00000000e+00, 1.00000000e+00]], dtype=float32), y=array([[ 0.],
[ 0.],
[ 0.],
...,
[ 2.],
[ 2.],
[ 2.]]), sample_weight=None, tree_idx=0, n_trees=100, verbose=0, class_weight=None)
116 warnings.simplefilter('ignore', DeprecationWarning)
117 curr_sample_weight *= compute_sample_weight('auto', y, indices)
118 elif class_weight == 'balanced_subsample':
119 curr_sample_weight *= compute_sample_weight('balanced', y, indices)
120
--> 121 tree.fit(X, y, sample_weight=curr_sample_weight, check_input=False)
tree.fit = <bound method ExtraTreeClassifier.fit of ExtraTr...om_state=209652396,
splitter='random')>
X = array([[ 1.05619999e-02, 9.99999997e-07, 5.....00000000e+00, 1.00000000e+00]], dtype=float32)
y = array([[ 0.],
[ 0.],
[ 0.],
...,
[ 2.],
[ 2.],
[ 2.]])
sample_weight = None
curr_sample_weight = array([ 0., 0., 1., ..., 0., 1., 0.])
122 else:
123 tree.fit(X, y, sample_weight=sample_weight, check_input=False)
124
125 return tree

...........................................................................
anaconda2/lib/python2.7/site-packages/sklearn/tree/tree.py in fit(self=ExtraTreeClassifier(class_weight=None, criterion...dom_state=209652396,
splitter='random'), X=array([[ 1.05619999e-02, 9.99999997e-07, 5.....00000000e+00, 1.00000000e+00]], dtype=float32), y=array([[ 0.],
[ 0.],
[ 0.],
...,
[ 2.],
[ 2.],
[ 2.]]), sample_weight=array([ 0., 0., 1., ..., 0., 1., 0.]), check_input=False, X_idx_sorted=None)
785
786 super(DecisionTreeClassifier, self).fit(
787 X, y,
788 sample_weight=sample_weight,
789 check_input=check_input,
--> 790 X_idx_sorted=X_idx_sorted)
X_idx_sorted = None
791 return self
792
793 def predict_proba(self, X, check_input=True):
794 """Predict class probabilities of the input samples X.

...........................................................................
anaconda2/lib/python2.7/site-packages/sklearn/tree/tree.py in fit(self=ExtraTreeClassifier(class_weight=None, criterion...dom_state=209652396,
splitter='random'), X=array([[ 1.05619999e-02, 9.99999997e-07, 5.....00000000e+00, 1.00000000e+00]], dtype=float32), y=array([[ 0.],
[ 0.],
[ 0.],
...,
[ 2.],
[ 2.],
[ 2.]]), sample_weight=array([ 0., 0., 1., ..., 0., 1., 0.]), check_input=False, X_idx_sorted=None)
189 if isinstance(self.min_samples_split, (numbers.Integral, np.integer)):
190 if not 2 <= self.min_samples_split:
191 raise ValueError("min_samples_split must be an integer "
192 "greater than 1 or a float in (0.0, 1.0]; "
193 "got the integer %s"
--> 194 % self.min_samples_split)
self.min_samples_split = 1
195 min_samples_split = self.min_samples_split
196 else: # float
197 if not 0. < self.min_samples_split <= 1.:
198 raise ValueError("min_samples_split must be an integer "

ValueError: min_samples_split must be an integer greater than 1 or a float in (0.0, 1.0]; got the integer 1

class label

hello, i didn't know the mean of class label. i found that this was an error as follow in the example file:"Class 0 means no introgression. Class 1 is introgression from population 2 into population 1, and class 2 is introgression from population 2 into population 1. (Class -1 is described above.)"

Groups of fasta input files

Hello, there are a few things I don't quite understand about this software, and I'd appreciate some answers!
Now I used the example files in the package for training, and step5 started using my own data for analysis, but how did I get the fasta files for pop1 and pop2? Are they separated by chromosome after converting to pseudogenome using vcf?
Thank you very much for your help.

install question

Hi; when i used the "make" code to compile the FILET software, i met a question as follows:

/usr/bin/ld: cannot find -lgsl
/usr/bin/ld: cannot find -lgslcblas
collect2: error: ld returned 1 exit status
make: *** [msMaskAllRows] Error 1

condition on migration option produces inf values

Hi @andrewkern,
I was working with msmove to create training simulations. When I use the '-c' option in twoPopnStats_forML to condition on migration (//asterisk in msmove) the values of dd1 and dd2 are inf or very large.

Below are examples from the trainingSims directory:
twoPopnStats_forML -c 20 14 < mig21.msOut
dd1 = inf
dd2 = 4775280.894101

twoPopnStats_forML 20 14 < mig21.msOut
dd1 = 0.817502
dd2 = 1.247589

thanks,
@stsmall

link to manuscript?

In the README it says:

The manuscript describing FILET will be posted on biorxiv shortly, at which point I will add the link to this README.

... heh =)

Mask File Example Format

Hi @andrewkern, @dschride,
I have already masked repeats and low-confidence calls in my Fasta files and would like to also include a mask file with the training data. I am a little unclear as to the format of this file, would you be able to provide a brief example please.
thanks,
@stsmall

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.